Exit walsender before confirming remote flush in logical replication

Started by Hayato Kuroda (Fujitsu)about 3 years ago71 messages
#1Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
1 attachment(s)

Dear hackers,
(I added Amit as CC because we discussed in another thread)

This is a fork thread from time-delayed logical replication [1]https://commitfest.postgresql.org/41/3581/.
While discussing, we thought that we could extend the condition of walsender shutdown[2]/messages/by-id/TYAPR01MB58661BA3BF38E9798E59AE14F5E19@TYAPR01MB5866.jpnprd01.prod.outlook.com[3]/messages/by-id/CAA4eK1LyetktcphdRrufHac4t5DGyhsS2xG2DSOGb7OaOVcDVg@mail.gmail.com.

Currently, walsenders delay the shutdown request until confirming all sent data
are flushed on remote side. This condition was added in 985bd7[4]https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459, which is for
supporting clean switchover. Supposing that there is a primary-secondary
physical replication system, and do following steps. If any changes are come
while step 2 but the walsender does not confirm the remote flush, the reboot in
step 3 may be failed.

1. Stops primary server.
2. Promotes secondary to new primary.
3. Reboot (old)primary as new secondary.

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

PSA the minimal patch for that. I'm not sure whether WalSndCaughtUp should be
also omitted or not. It seems that changes may affect other parts like
WalSndWaitForWal(), but we can investigate more about it.

[1]: https://commitfest.postgresql.org/41/3581/
[2]: /messages/by-id/TYAPR01MB58661BA3BF38E9798E59AE14F5E19@TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: /messages/by-id/CAA4eK1LyetktcphdRrufHac4t5DGyhsS2xG2DSOGb7OaOVcDVg@mail.gmail.com
[4]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-Exit-walsender-before-confirming-remote-flush-in-log.patchapplication/octet-stream; name=0001-Exit-walsender-before-confirming-remote-flush-in-log.patchDownload
From cc444e339af93bf4a27ac644f5d65b0466b65126 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 22 Dec 2022 02:49:48 +0000
Subject: [PATCH] Exit walsender before confirming remote flush in logical
 replication

---
 src/backend/replication/walsender.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..08d4a9861f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3099,8 +3099,9 @@ XLogSendLogical(void)
  * NB: This should only be called when the shutdown signal has been received
  * from postmaster.
  *
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.
  */
 static void
 WalSndDone(WalSndSendDataCallback send_data)
@@ -3118,8 +3119,17 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	/*
+	 * Exit if we are in the convenient time.
+	 *
+	 * Note that in case of logical replication, we don't have to wait that all
+	 * sent data to be flushed on the subscriber. It will request to send WALs
+	 * from the last received point, and we cannot support clean switchover in
+	 * logical replication.
+	 */
+	if (WalSndCaughtUp &&
+		(send_data == XLogSendLogical ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
-- 
2.27.0

#2Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#1)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear hackers,
(I added Amit as CC because we discussed in another thread)

This is a fork thread from time-delayed logical replication [1].
While discussing, we thought that we could extend the condition of walsender shutdown[2][3].

Currently, walsenders delay the shutdown request until confirming all sent data
are flushed on remote side. This condition was added in 985bd7[4], which is for
supporting clean switchover. Supposing that there is a primary-secondary
physical replication system, and do following steps. If any changes are come
while step 2 but the walsender does not confirm the remote flush, the reboot in
step 3 may be failed.

1. Stops primary server.
2. Promotes secondary to new primary.
3. Reboot (old)primary as new secondary.

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

I have not reviewed the patch though.

--
Best Wishes,
Ashutosh Bapat

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Ashutosh Bapat (#2)
Re: Exit walsender before confirming remote flush in logical replication

At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional

..

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

...

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone. And I'm afraind that many people already
relies on that behavior.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it. I'm
fine doing that only in the apply_delay case, though. However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#3)
Re: Exit walsender before confirming remote flush in logical replication

On Fri, Dec 23, 2022 at 7:51 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional

..

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

...

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone.

Do you see it was discussed somewhere? If so, can you please point to
that discussion?

And I'm afraind that many people already
relies on that behavior.

But OTOH, it can also be annoying for users to see some wait during
the shutdown which is actually not required.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it. I'm
fine doing that only in the apply_delay case, though. However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

How can we send commit_data.end_lsn before actually committing the
xact? I think this can lead to a problem because next time (say after
restart of walsender) server can skip sending the xact even if it is
not committed by the client.

--
With Regards,
Amit Kapila.

#5Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#3)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1]/messages/by-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos and flushpos in send_feedback().

Followings were contents of the feedback message I got, and we could see that recv and flush were overwritten.

```
DEBUG: sending feedback (force 1) to recv 0/1553638, write 0/1553550, flush 0/1553638
CONTEXT: processing remote data for replication origin "pg_16390" during message type "BEGIN" in transaction 730, finished at 0/1553638
```

In terms of walsender, however, sentPtr seemed to be slightly larger than flushed position on subscriber.

```
(gdb) p MyWalSnd->sentPtr
$2 = 22361760
(gdb) p MyWalSnd->flush
$3 = 22361656
(gdb) p *MyWalSnd
$4 = {pid = 28807, state = WALSNDSTATE_STREAMING, sentPtr = 22361760, needreload = false, write = 22361656,
flush = 22361656, apply = 22361424, writeLag = 20020343, flushLag = 20020343, applyLag = 20020343,
sync_standby_priority = 0, mutex = 0 '\000', latch = 0x7ff0350cbb94, replyTime = 725113263592095}
```

Therefore I could not shut down the publisher node when applications were delaying.
Do you have any opinions about them?

```
$ pg_ctl stop -D data_pub/
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

[1]: /messages/by-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#6Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#5)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1] for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos
and flushpos in send_feedback().

Maybe I misunderstood what you said... I have also found that sentPtr is not the actual sent
position, but the starting point of the next WAL. You can see the comment below.

```
/*
* How far have we sent WAL already? This is also advertised in
* MyWalSnd->sentPtr. (Actually, this is the next WAL location to send.)
*/
static XLogRecPtr sentPtr = InvalidXLogRecPtr;
```

We must use end_lsn for crafting messages to cheat the walsender, but such records
are included in COMMIT, not in BEGIN for the non-streaming case.
But if workers are delayed in apply_handle_commit(), will they hold locks for database
objects for a long time and it causes another issue.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#1)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

+1 for the idea.

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Dilip Kumar (#7)
1 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Dilip,

Thanks for checking my proposal!

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v2-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v2-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload
From 8a28c255dd98c29fbdf60749d201a8e0010d507c Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 22 Dec 2022 02:49:48 +0000
Subject: [PATCH v2] Exit walsender before confirming remote flush in logical
 replication

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

Author: Hayato Kuroda
---
 src/backend/replication/walsender.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..7cc60a7dd1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3099,8 +3099,9 @@ XLogSendLogical(void)
  * NB: This should only be called when the shutdown signal has been received
  * from postmaster.
  *
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * physical replication mode and all WALs are not yet replicated, this function
+ * will return control to the caller.
  */
 static void
 WalSndDone(WalSndSendDataCallback send_data)
@@ -3118,8 +3119,17 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	/*
+	 * Exit if we are in the convenient time.
+	 *
+	 * Note that in case of logical replication, we don't have to wait that all
+	 * sent data to be flushed on the subscriber. It will request to send WALs
+	 * from the last received point, and we cannot support clean switchover in
+	 * logical replication.
+	 */
+	if (WalSndCaughtUp &&
+		(send_data == XLogSendLogical ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
-- 
2.27.0

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#8)
Re: Exit walsender before confirming remote flush in logical replication

On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Thanks for checking my proposal!

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

--
With Regards,
Amit Kapila.

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#9)
Re: Exit walsender before confirming remote flush in logical replication

On Tue, Dec 27, 2022 at 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Thanks for checking my proposal!

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

--
With Regards,
Amit Kapila.

#11Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#9)
1 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

I did not have strong opinion around here. Fixed.

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

Yes, it may happen even if this work is done. And your analysis is correct that
the receive buffer is rarely full in physical replication because walreceiver
immediately flush WALs.
I think this is an architectural problem. Maybe we have assumed that the decoded
WALs are consumed in as short time. I do not have good idea, but one approach is
introducing a new process logical-walreceiver. It will record the decoded WALs to
the persistent storage and workers consume and then remove them. It may have huge
impact for other features and should not be accepted...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v3-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v3-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload
From 6adc81cfc436317c50f6b23faa9d5ac6a8ca42cd Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 22 Dec 2022 02:49:48 +0000
Subject: [PATCH v3] Exit walsender before confirming remote flush in logical
 replication

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

Author: Hayato Kuroda
---
 src/backend/replication/walsender.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..b648beca75 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3099,8 +3099,9 @@ XLogSendLogical(void)
  * NB: This should only be called when the shutdown signal has been received
  * from postmaster.
  *
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * physical replication mode and all WALs are not yet replicated, this function
+ * will return control to the caller.
  */
 static void
 WalSndDone(WalSndSendDataCallback send_data)
@@ -3118,8 +3119,16 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	/*
+	 * Exit if we are in the convenient time.
+	 *
+	 * Note that in case of logical replication, we don't have to wait that all
+	 * sent data to be flushed on the subscriber. It will request to send WALs
+	 * from the last received point, and we cannot support clean switchover in
+	 * logical replication.
+	 */
+	if (WalSndCaughtUp && !pq_is_send_pending() &&
+		(send_data == XLogSendLogical || sentPtr == replicatedPtr))
 	{
 		QueryCompletion qc;
 
-- 
2.27.0

#12Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#10)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

Right, I could reproduce the scenario with following steps.

1. Construct pub -> sub logical replication system with streaming = off.
2. Define a table on both nodes.

```
CREATE TABLE tbl (id int PRIMARY KEY);
```

3. Execute concurrent transactions.

Tx-1 (on subscriber)
BEGIN;
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

Tx-2 (on publisher)
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

4. Try to shutdown publisher but it will be failed.

```
$ pg_ctl stop -D publisher
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#12)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Dec 28, 2022 at 8:19 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

Right, I could reproduce the scenario with following steps.

1. Construct pub -> sub logical replication system with streaming = off.
2. Define a table on both nodes.

```
CREATE TABLE tbl (id int PRIMARY KEY);
```

3. Execute concurrent transactions.

Tx-1 (on subscriber)
BEGIN;
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

Tx-2 (on publisher)
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

4. Try to shutdown publisher but it will be failed.

```
$ pg_ctl stop -D publisher
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Thanks for the verification. BTW, do you think we should document this
either with time-delayed replication or otherwise unless this is
already documented?

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

--
With Regards,
Amit Kapila.

#14Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#13)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Thanks for the verification. BTW, do you think we should document this
either with time-delayed replication or otherwise unless this is
already documented?

I think this should be documented at "Shutting Down the Server" section in runtime.sgml
or logical-replicaiton.sgml, but I cannot find. It will be included in next version.

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

I have not done yet about it, will continue next year.
It seems that walsenders have been sending all data before shutting down since ea5516,
e0b581 and 754baa.
There were many threads related with streaming replication, so I could not pin
the specific message that written in the commit message of ea5516.

I have also checked some wiki pages [1]https://wiki.postgresql.org/wiki/Streaming_Replication[2]https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal, but I could not find any design about it.

[1]: https://wiki.postgresql.org/wiki/Streaming_Replication
[2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#11)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Dec 28, 2022 at 8:18 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Amit,

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

I did not have strong opinion around here. Fixed.

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

Yes, it may happen even if this work is done. And your analysis is correct that
the receive buffer is rarely full in physical replication because walreceiver
immediately flush WALs.

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

--
With Regards,
Amit Kapila.

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#14)
Re: Exit walsender before confirming remote flush in logical replication

At Wed, 28 Dec 2022 09:15:41 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

I have not done yet about it, will continue next year.
It seems that walsenders have been sending all data before shutting down since ea5516,
e0b581 and 754baa.
There were many threads related with streaming replication, so I could not pin
the specific message that written in the commit message of ea5516.

I have also checked some wiki pages [1][2], but I could not find any design about it.

[1]: https://wiki.postgresql.org/wiki/Streaming_Replication
[2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit. Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#15)
Re: Exit walsender before confirming remote flush in logical replication

At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply. The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#17)
1 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san, Amit,

At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com>
wrote in

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply. The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

If Amit wanted to say about the case that sending data is pending in physical
replication, the walsender cannot stop. But this is not related with the
synchronous_commit: it is caused because it must sweep all pending data before
shutting down. We can reproduce the situation with:

1. build streaming replication system
2. kill -STOP $walreceiver
3. insert data to primary server
4. try to stop the primary server

If what you said was not related with pending data, walsender can be stopped even
if the synchronous_commit = remote_apply. As Horiguchi-san said, such a condition
is not written in WalSndDone() [1]https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3121. I think the parameter synchronous_commit does
not affect walsender process so well. It just define when backend returns the
result to client.

I could check by following steps:

1. built streaming replication system. PSA the script to follow that.

Primary config.

```
synchronous_commit = 'remote_apply'
synchronous_standby_names = 'secondary'
```

Secondary config.

```
recovery_min_apply_delay = 1d
primary_conninfo = 'user=postgres port=$port_N1 application_name=secondary'
hot_standby = on
```

2. inserted data to primary. This waited the remote apply

psql -U postgres -p $port_primary -c "INSERT INTO tbl SELECT generate_series(1, 5000)"

3. Stopped the primary server from another terminal. It could be done.
The terminal on step2 said like:

```
WARNING: canceling the wait for synchronous replication and terminating connection due to administrator command
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
```

[1]: https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3121

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

test_phy.shapplication/octet-stream; name=test_phy.shDownload
#19Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#16)
1 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit.

AFAIK the condition (sentPtr == replicatedPtr) seemed to be introduced for the purpose[1]https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459.
You meant to say that the conditon (!pq_is_send_pending()) has same motivation, right?

Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

I was not clear the word "next startup", but I agreed that we can shut down the
walsender in case of recovery_min_apply_delay > 0 and synchronous_commit = remote_apply.
The startup process will be not terminated even if the primary crashes, so I
think the process will apply the transaction sooner or later.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

Another consideration is that the condition (!pq_is_send_pending()) ensures that
there are no pending messages, including other packets. Currently we force walsenders
to clean up all messages before shutting down, even if it is a keepalive one.
I cannot have any problems caused by this, but I can keep the condition in case of
logical replication.

I updated the patch accordingly. Also, I found that the previous version
did not work well in case of streamed transactions. When a streamed transaction
is committed on publisher but the application is delayed on subscriber, the
process sometimes waits until there is no pending write. This is done in
ProcessPendingWrites(). I added another termination path in the function.

[1]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v4-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v4-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload
From cb861c1de3c8cd70b7cb2fe47711ef36fbd16bd2 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 22 Dec 2022 02:49:48 +0000
Subject: [PATCH v4] Exit walsender before confirming remote flush in logical
 replication

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

Author: Hayato Kuroda
---
 doc/src/sgml/logical-replication.sgml | 10 ++++++
 src/backend/replication/walsender.c   | 45 +++++++++++++++++----------
 2 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 54f48be87f..403c518b51 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1694,6 +1694,16 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
    table is in progress, there will be additional workers for the tables
    being synchronized.
   </para>
+
+  <caution>
+   <para>
+    Unlike physical replication, data synchronization by logical replication is
+    more likely to be suspended. It is because workers sometimes wait for
+    acquiring locks and they do not consume messages from the publisher. It
+    will be resolved automatically when workers acquire locks and start
+    consuming arrivals.
+   </para>
+  </caution>
  </sect1>
 
  <sect1 id="logical-replication-security">
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 015ae2995d..dbca93dd9d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/* If we got shut down requested, try to exit the process */
+		if (got_STOPPING)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -2513,18 +2517,14 @@ WalSndLoop(WalSndSendDataCallback send_data)
 										 application_name)));
 				WalSndSetState(WALSNDSTATE_STREAMING);
 			}
-
-			/*
-			 * When SIGUSR2 arrives, we send any outstanding logs up to the
-			 * shutdown checkpoint record (i.e., the latest record), wait for
-			 * them to be replicated to the standby, and exit. This may be a
-			 * normal termination at shutdown, or a promotion, the walsender
-			 * is not sure which.
-			 */
-			if (got_SIGUSR2)
-				WalSndDone(send_data);
 		}
 
+		/*
+		 * When SIGUSR2 arrives, try to exit the process.
+		 */
+		if (got_SIGUSR2)
+			WalSndDone(send_data);
+
 		/* Check for replication timeout. */
 		WalSndCheckTimeOut();
 
@@ -3094,13 +3094,14 @@ XLogSendLogical(void)
 }
 
 /*
- * Shutdown if the sender is caught up.
+ * Shutdown if the sender is we are in a convenient time.
  *
  * NB: This should only be called when the shutdown signal has been received
  * from postmaster.
  *
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * physical replication mode and all WALs are not yet replicated, this function
+ * will return control to the caller.
  */
 static void
 WalSndDone(WalSndSendDataCallback send_data)
@@ -3118,15 +3119,27 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	/*
+	 * Exit if we are in the convenient time.
+	 *
+	 * Note that in case of logical replication, we don't have to wait that all
+	 * sent data to be flushed on the subscriber. It will request to send WALs
+	 * from the last received point, and we cannot support clean switchover in
+	 * logical replication.
+	 */
+	if (WalSndCaughtUp &&
+		(send_data == XLogSendLogical ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		if (send_data == XLogSendLogical)
+			pq_flush_if_writable();
+		else
+			pq_flush();
 
 		proc_exit(0);
 	}
-- 
2.27.0

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#19)
Re: Exit walsender before confirming remote flush in logical replication

On Mon, Jan 16, 2023 at 4:39 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

Another consideration is that the condition (!pq_is_send_pending()) ensures that
there are no pending messages, including other packets. Currently we force walsenders
to clean up all messages before shutting down, even if it is a keepalive one.
I cannot have any problems caused by this, but I can keep the condition in case of
logical replication.

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend. Similarly, if the user has configured a
large value of time-delayed apply, it can lead to a network buffer
full between walsender and receive/process.

The condition to allow the shutdown to wait for all WAL to be sent has
two parts: (a) it confirms that there is no pending WAL to be sent;
(b) it confirms all the WAL sent has been flushed by the client. As
per our understanding, both these conditions are to allow clean
switchover/failover which seems to be useful only for physical
replication. The logical replication doesn't provide such
functionality.

The proposed patch tries to eliminate condition (b) for logical
replication in the hopes that the same will allow the shutdown to be
complete in most cases. There is no specific reason discussed to not
do (a) for logical replication.

Now, to proceed here we have the following options: (1) Fix (b) as
proposed by the patch and document the risks related to (a); (2) Fix
both (a) and (b); (3) Do nothing and document that users need to
unblock the subscribers to complete the shutdown.

Thoughts?

--
With Regards,
Amit Kapila.

#21Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#20)
1 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit, hackers,

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Thanks for summarizing.
While analyzing stuck, I noticed that there are two types of shutdown failures.
They could be characterized by the back trace. They are shown at the bottom.

Type i)
The walsender executes WalSndDone(), but cannot satisfy the condition.
It means that all WALs have been sent to the subscriber but have not flushed;
sentPtr is not the same as replicatedPtr. This stuck can happen when the delayed
transaction is small or streamed.

Type ii)
The walsender cannot execute WalSndDone(), stacks at ProcessPendingWrites().
It means that when the send buffer becomes full while replicating a transaction;
pq_is_send_pending() returns true and the walsender cannot break the loop.
This stuck can happen when the delayed transaction is large, but it is not a streamed one.

If we choose modification (1), we can only fix type (i) because pending WALs cause
the failure. IIUC if we want to shut down walsender processes even if (ii), we must
choose (2) and additional fixes are needed.

Based on the above, I prefer modification (2) because it can rescue more cases. Thoughts?
PSA the patch for it. It is almost the same as the previous version, but the comments are updated.

Appendinx:

The backtrace for type i)

```
#0 WalSndDone (send_data=0x87f825 <XLogSendLogical>) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:3111
#1 0x000000000087ed1d in WalSndLoop (send_data=0x87f825 <XLogSendLogical>) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:2525
#2 0x000000000087d40a in StartLogicalReplication (cmd=0x1f49030) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1320
#3 0x000000000087df29 in exec_replication_command (
cmd_string=0x1f15498 "START_REPLICATION SLOT \"sub\" LOGICAL 0/0 (proto_version '4', streaming 'on', origin 'none', publication_names '\"pub\"')")
at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1830
#4 0x000000000091b032 in PostgresMain (dbname=0x1f4c938 "postgres", username=0x1f4c918 "postgres")
at ../../PostgreSQL-Source-Dev/src/backend/tcop/postgres.c:4561
#5 0x000000000085390b in BackendRun (port=0x1f3d0b0) at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:4437
#6 0x000000000085322c in BackendStartup (port=0x1f3d0b0) at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:4165
#7 0x000000000084f7a2 in ServerLoop () at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:1762
#8 0x000000000084f0a2 in PostmasterMain (argc=3, argv=0x1f0ff30) at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:1452
#9 0x000000000074a4d6 in main (argc=3, argv=0x1f0ff30) at ../../PostgreSQL-Source-Dev/src/backend/main/main.c:200
```

The backtrace for type ii)

```
#0 ProcessPendingWrites () at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1438
#1 0x000000000087d635 in WalSndWriteData (ctx=0x1429ce8, lsn=22406440, xid=731, last_write=true)
at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1405
#2 0x0000000000888420 in OutputPluginWrite (ctx=0x1429ce8, last_write=true) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/logical.c:669
#3 0x00007f022dfe43a7 in pgoutput_change (ctx=0x1429ce8, txn=0x1457d40, relation=0x7f0245075268, change=0x1460ef8)
at ../../PostgreSQL-Source-Dev/src/backend/replication/pgoutput/pgoutput.c:1491
#4 0x0000000000889125 in change_cb_wrapper (cache=0x142bcf8, txn=0x1457d40, relation=0x7f0245075268, change=0x1460ef8)
at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/logical.c:1077
#5 0x000000000089507c in ReorderBufferApplyChange (rb=0x142bcf8, txn=0x1457d40, relation=0x7f0245075268, change=0x1460ef8, streaming=false)
at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:1969
#6 0x0000000000895866 in ReorderBufferProcessTXN (rb=0x142bcf8, txn=0x1457d40, commit_lsn=23060624, snapshot_now=0x1440150, command_id=0, streaming=false)
at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2245
#7 0x0000000000896348 in ReorderBufferReplay (txn=0x1457d40, rb=0x142bcf8, xid=731, commit_lsn=23060624, end_lsn=23060672, commit_time=727353664342177,
origin_id=0, origin_lsn=0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2675
#8 0x00000000008963d0 in ReorderBufferCommit (rb=0x142bcf8, xid=731, commit_lsn=23060624, end_lsn=23060672, commit_time=727353664342177, origin_id=0,
origin_lsn=0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2699
#9 0x00000000008842c7 in DecodeCommit (ctx=0x1429ce8, buf=0x7ffcf03731a0, parsed=0x7ffcf0372fa0, xid=731, two_phase=false)
at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:682
#10 0x0000000000883667 in xact_decode (ctx=0x1429ce8, buf=0x7ffcf03731a0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:216
#11 0x000000000088338b in LogicalDecodingProcessRecord (ctx=0x1429ce8, record=0x142a080)
at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:119
#12 0x000000000087f8c7 in XLogSendLogical () at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:3060
#13 0x000000000087ec5a in WalSndLoop (send_data=0x87f825 <XLogSendLogical>) at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:2490
...
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v5-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v5-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload
From 40340027c4ed5397a670eb3623cbfc9b1235c848 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 22 Dec 2022 02:49:48 +0000
Subject: [PATCH v5] Exit walsender before confirming remote flush in logical
 replication

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

Author: Hayato Kuroda
---
 doc/src/sgml/logical-replication.sgml | 10 ++++++
 src/backend/replication/walsender.c   | 50 ++++++++++++++++++---------
 2 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 6407804547..88b9a63f30 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1701,6 +1701,16 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
    table is in progress, there will be additional workers for the tables
    being synchronized.
   </para>
+
+  <caution>
+   <para>
+    Unlike physical replication, data synchronization by logical replication is
+    more likely to be suspended. It is because workers sometimes wait for
+    acquiring locks and they do not consume messages from the publisher. It
+    will be resolved automatically when workers acquire locks and start
+    consuming arrivals.
+   </para>
+  </caution>
  </sect1>
 
  <sect1 id="logical-replication-security">
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 015ae2995d..0179eb7142 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/* If we got shut down requested, try to exit the process */
+		if (got_STOPPING)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -2513,18 +2517,14 @@ WalSndLoop(WalSndSendDataCallback send_data)
 										 application_name)));
 				WalSndSetState(WALSNDSTATE_STREAMING);
 			}
-
-			/*
-			 * When SIGUSR2 arrives, we send any outstanding logs up to the
-			 * shutdown checkpoint record (i.e., the latest record), wait for
-			 * them to be replicated to the standby, and exit. This may be a
-			 * normal termination at shutdown, or a promotion, the walsender
-			 * is not sure which.
-			 */
-			if (got_SIGUSR2)
-				WalSndDone(send_data);
 		}
 
+		/*
+		 * When SIGUSR2 arrives, try to exit the process.
+		 */
+		if (got_SIGUSR2)
+			WalSndDone(send_data);
+
 		/* Check for replication timeout. */
 		WalSndCheckTimeOut();
 
@@ -3094,13 +3094,14 @@ XLogSendLogical(void)
 }
 
 /*
- * Shutdown if the sender is caught up.
+ * Shutdown if the sender is we are in a convenient time.
  *
  * NB: This should only be called when the shutdown signal has been received
  * from postmaster.
  *
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * physical replication mode and all WALs are not yet replicated, this function
+ * will return control to the caller.
  */
 static void
 WalSndDone(WalSndSendDataCallback send_data)
@@ -3118,15 +3119,32 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	/*
+	 * Exit if we are in the convenient time.
+	 *
+	 * When we are logical replication mode, we don't have to wait that all
+	 * sent data to be flushed on the subscriber because we cannot support
+	 * clean switchover for it.
+	 */
+	if (WalSndCaughtUp &&
+		(send_data == XLogSendLogical ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+
+		/*
+		 * Flush pending data if writable.
+		 *
+		 * Note that the output buffer may be full in case of logical
+		 * replication. If pq_flush() is called at that time, the walsender
+		 * process will be stuck. Therefore, call pq_flush_if_writable()
+		 * instead.
+		 */
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
-- 
2.27.0

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#20)
Re: Exit walsender before confirming remote flush in logical replication

On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend.

While checking the commits and email discussions in this area, I came
across the email [1]/messages/by-id/CAB7nPqR3icaA=qMv_FuU8YVYH3KUrNMnq_OmCfkzxCHC4fox8w@mail.gmail.com from Michael where something similar seems to
have been discussed. Basically, whether the early shutdown of
walsender can prevent a switchover between publisher and subscriber
but that part was never clearly answered in that email chain. It might
be worth reading the entire discussion [2]/messages/by-id/CAHGQGwEsttg9P9LOOavoc9d6VB1zVmYgfBk=Ljsk-UL9cEf-eA@mail.gmail.com. That discussion finally
lead to the following commit:

commit c6c333436491a292d56044ed6e167e2bdee015a2
Author: Andres Freund <andres@anarazel.de>
Date: Mon Jun 5 18:53:41 2017 -0700

Prevent possibility of panics during shutdown checkpoint.

When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.

To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown.
...
...

Here, as mentioned in the commit, we are trying to ensure that before
checkpoint writes its shutdown WAL record, we ensure that "if logical
replication is in progress all existing WAL records are processed
followed by a shutdown.". I think even before this commit, we try to
send the entire WAL before shutdown but not completely sure. There was
no discussion on what happens if the logical walreceiver/walapply
process is waiting on some heavy-weight lock and the network socket
buffer is full due to which walsender is not able to process its WAL.
Is it okay for shutdown to fail in such a case as it is happening now,
or shall we somehow detect that and shut down the walsender, or we
just allow logical walsender to always exit immediately as soon as the
shutdown signal came?

Note: I have added some of the people involved in the previous
thread's [2]/messages/by-id/CAHGQGwEsttg9P9LOOavoc9d6VB1zVmYgfBk=Ljsk-UL9cEf-eA@mail.gmail.com discussion in the hope that they can share their
thoughts.

[1]: /messages/by-id/CAB7nPqR3icaA=qMv_FuU8YVYH3KUrNMnq_OmCfkzxCHC4fox8w@mail.gmail.com
[2]: /messages/by-id/CAHGQGwEsttg9P9LOOavoc9d6VB1zVmYgfBk=Ljsk-UL9cEf-eA@mail.gmail.com

--
With Regards,
Amit Kapila.

#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#22)
Re: Exit walsender before confirming remote flush in logical replication

On Fri, Jan 20, 2023 at 4:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend.

While checking the commits and email discussions in this area, I came
across the email [1] from Michael where something similar seems to
have been discussed. Basically, whether the early shutdown of
walsender can prevent a switchover between publisher and subscriber
but that part was never clearly answered in that email chain. It might
be worth reading the entire discussion [2]. That discussion finally
lead to the following commit:

Right, in the thread the question is raised about whether it makes
sense for logical replication to send all WALs but there is no
conclusion on that. But I think this patch is mainly about resolving
the PANIC due to extra WAL getting generated by walsender during
checkpoint processing and that's the reason the behavior of sending
all the WAL is maintained but only the extra WAL generation stopped
(before shutdown checkpoint can proceed) using this new state

commit c6c333436491a292d56044ed6e167e2bdee015a2
Author: Andres Freund <andres@anarazel.de>
Date: Mon Jun 5 18:53:41 2017 -0700

Prevent possibility of panics during shutdown checkpoint.

When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.

To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown.
...
...

Here, as mentioned in the commit, we are trying to ensure that before
checkpoint writes its shutdown WAL record, we ensure that "if logical
replication is in progress all existing WAL records are processed
followed by a shutdown.". I think even before this commit, we try to
send the entire WAL before shutdown but not completely sure.

Yes, I think that there is no change in that behavior.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#24Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Dilip Kumar (#23)
RE: Exit walsender before confirming remote flush in logical replication

Dear Dilip, hackers,

Thanks for giving your opinion. I analyzed the relation with the given commit,
and I thought I could keep my patch. How do you think?

# Abstract

* Some modifications should be needed.
* We cannot rollback the shutdown if walsenders are stuck
* We don't have a good way to detect stuck

# Discussion

Compared to physical replication, it is likely to happen that logical replication is stuck.
I think the risk should be avoided as much as possible by fixing codes.
Then, if it leads another failure, we can document the caution to users.

While shutting down the server, checkpointer sends SIGUSR1 signal to wansenders.
It is done after being exited other processes, so we cannot raise ERROR and rollback
the operation if checkpointer recognize the process stuck at that time.

We don't have any features that postmaster can check whether this node is
publisher or not. So if we want to add the mechanism that can check the health
of walsenders before shutting down, we must do that at the top of
process_pm_shutdown_request() even if we are not in logical replication.
I think it affects the basis of postgres largely, and in the first place,
PostgreSQL does not have a mechanism to check the health of process.

Therefore, I want to adopt the approach that walsender itself exits immediately when they get signals.

## About patch - Were fixes correct?

In ProcessPendingWrites(), my patch, wansender calls WalSndDone() when it gets
SIGUSR1 signal. I think this should be. From the patch [1]/messages/by-id/TYCPR01MB58701A47F35FED0A2B399662F5C49@TYCPR01MB5870.jpnprd01.prod.outlook.com:

```
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
                /* Try to flush pending output to the client */
                if (pq_flush_if_writable() != 0)
                        WalSndShutdown();
+
+               /* If we got shut down requested, try to exit the process */
+               if (got_STOPPING)
+                       WalSndDone(XLogSendLogical);
        }

/* reactivate latch so WalSndLoop knows to continue */
```

Per my analysis, in case of logical replication, walsenders exit with following
steps. Note that logical walsender does not receive SIGUSR2 signal, set flag by
themselves instead:

1. postmaster sends shutdown requests to checkpointer
2. checkpointer sends SIGUSR1 to walsenders and wait
3. when walsenders accept SIGUSR1, they turn got_SIGUSR1 on.
4. walsenders consume all WALs. @XLogSendLogical
5. walsenders turn got_SIGUSR2 on by themselves @XLogSendLogical
6. walsenders recognize the flag is on, so call WalSndDone() @ WalSndLoop
7. proc_exit(0)
8. checkpoitner writes shutdown record
...

Type (i) stuck, I reported in -hackers[1]/messages/by-id/TYCPR01MB58701A47F35FED0A2B399662F5C49@TYCPR01MB5870.jpnprd01.prod.outlook.com, means that processes stop at step 6
and Type (ii) stuck means that processes stop at 4. In step4, got_SIGUSR2 is never set to on, so
we must use got_STOPPING flag.

[1]: /messages/by-id/TYCPR01MB58701A47F35FED0A2B399662F5C49@TYCPR01MB5870.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#25Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#22)
Re: Exit walsender before confirming remote flush in logical replication

On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend.

While checking the commits and email discussions in this area, I came
across the email [1] from Michael where something similar seems to
have been discussed. Basically, whether the early shutdown of
walsender can prevent a switchover between publisher and subscriber
but that part was never clearly answered in that email chain. It might
be worth reading the entire discussion [2]. That discussion finally
lead to the following commit:

commit c6c333436491a292d56044ed6e167e2bdee015a2
Author: Andres Freund <andres@anarazel.de>
Date: Mon Jun 5 18:53:41 2017 -0700

Prevent possibility of panics during shutdown checkpoint.

When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.

To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown.
...
...

Here, as mentioned in the commit, we are trying to ensure that before
checkpoint writes its shutdown WAL record, we ensure that "if logical
replication is in progress all existing WAL records are processed
followed by a shutdown.". I think even before this commit, we try to
send the entire WAL before shutdown but not completely sure. There was
no discussion on what happens if the logical walreceiver/walapply
process is waiting on some heavy-weight lock and the network socket
buffer is full due to which walsender is not able to process its WAL.
Is it okay for shutdown to fail in such a case as it is happening now,
or shall we somehow detect that and shut down the walsender, or we
just allow logical walsender to always exit immediately as soon as the
shutdown signal came?

+1 to eliminate condition (b) for logical replication.

Regarding (a), as Amit mentioned before[1]/messages/by-id/CAA4eK1+pD654+XnrPugYueh7Oh22EBGTr6dA_fS0+gPiHayG9A@mail.gmail.com, I think we should check if
pq_is_send_pending() is false. Otherwise, we will end up terminating
the WAL stream without the done message. Which will lead to an error
message "ERROR: could not receive data from WAL stream: server closed
the connection unexpectedly" on the subscriber even at a clean
shutdown. In a case where pq_is_send_pending() doesn't become false
for a long time, (e.g., the network socket buffer got full due to the
apply worker waiting on a lock), I think users should unblock it by
themselves. Or it might be practically better to shutdown the
subscriber first in the logical replication case, unlike the physical
replication case. I've not studied the time-delayed logical
replication patch yet, though.

Regards,

[1]: /messages/by-id/CAA4eK1+pD654+XnrPugYueh7Oh22EBGTr6dA_fS0+gPiHayG9A@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#25)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend.

...
...

+1 to eliminate condition (b) for logical replication.

Regarding (a), as Amit mentioned before[1], I think we should check if
pq_is_send_pending() is false.

Sorry, but your suggestion is not completely clear to me. Do you mean
to say that for logical replication, we shouldn't wait for all the WAL
to be successfully replicated but we should ensure to inform the
subscriber that XLOG streaming is done (by ensuring
pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

Otherwise, we will end up terminating
the WAL stream without the done message. Which will lead to an error
message "ERROR: could not receive data from WAL stream: server closed
the connection unexpectedly" on the subscriber even at a clean
shutdown.

But will that be a problem? As per docs of shutdown [1]https://www.postgresql.org/docs/devel/app-pg-ctl.html ( “Smart” mode
disallows new connections, then waits for all existing clients to
disconnect. If the server is in hot standby, recovery and streaming
replication will be terminated once all clients have disconnected.),
there is no such guarantee. I see that it is required for the
switchover in physical replication to ensure that all the WAL is sent
and replicated but we don't need that for logical replication.

In a case where pq_is_send_pending() doesn't become false
for a long time, (e.g., the network socket buffer got full due to the
apply worker waiting on a lock), I think users should unblock it by
themselves. Or it might be practically better to shutdown the
subscriber first in the logical replication case, unlike the physical
replication case.

Yeah, will users like such a dependency? And what will they gain by doing so?

[1]: https://www.postgresql.org/docs/devel/app-pg-ctl.html

--
With Regards,
Amit Kapila.

#27Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#26)
Re: Exit walsender before confirming remote flush in logical replication

At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Otherwise, we will end up terminating
the WAL stream without the done message. Which will lead to an error
message "ERROR: could not receive data from WAL stream: server closed
the connection unexpectedly" on the subscriber even at a clean
shutdown.

But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
disallows new connections, then waits for all existing clients to
disconnect. If the server is in hot standby, recovery and streaming
replication will be terminated once all clients have disconnected.),
there is no such guarantee. I see that it is required for the
switchover in physical replication to ensure that all the WAL is sent
and replicated but we don't need that for logical replication.

+1

Since publisher is not aware of apply-delay (by this patch), as a
matter of fact publisher seems gone before sending EOS in that
case. The error message is correctly describing that situation.

In a case where pq_is_send_pending() doesn't become false
for a long time, (e.g., the network socket buffer got full due to the
apply worker waiting on a lock), I think users should unblock it by
themselves. Or it might be practically better to shutdown the
subscriber first in the logical replication case, unlike the physical
replication case.

Yeah, will users like such a dependency? And what will they gain by doing so?

If PostgreSQL required such kind of special care about shutdown at
facing a trouble to keep replication consistency, that won't be
acceptable. The current time-delayed logical replication can be seen
as a kind of intentional continuous large network lag in this
aspect. And I think the consistency is guaranteed even in such cases.

On the other hand I don't think the almost all people care about the
exact progress when facing such troubles, as far as replication
consistently is maintained. IMHO that is also true for the
logical-delayed-replication case.

[1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#28Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#26)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Feb 1, 2023 at 6:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend.

...
...

+1 to eliminate condition (b) for logical replication.

Regarding (a), as Amit mentioned before[1], I think we should check if
pq_is_send_pending() is false.

Sorry, but your suggestion is not completely clear to me. Do you mean
to say that for logical replication, we shouldn't wait for all the WAL
to be successfully replicated but we should ensure to inform the
subscriber that XLOG streaming is done (by ensuring
pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

Yes.

Otherwise, we will end up terminating
the WAL stream without the done message. Which will lead to an error
message "ERROR: could not receive data from WAL stream: server closed
the connection unexpectedly" on the subscriber even at a clean
shutdown.

But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
disallows new connections, then waits for all existing clients to
disconnect. If the server is in hot standby, recovery and streaming
replication will be terminated once all clients have disconnected.),
there is no such guarantee.

In smart shutdown case, the walsender doesn't exit until it can flush
the done message, no?

I see that it is required for the
switchover in physical replication to ensure that all the WAL is sent
and replicated but we don't need that for logical replication.

It won't be a problem in practice in terms of logical replication. But
I'm concerned that this error could confuse users. Is there any case
where the client gets such an error at the smart shutdown?

In a case where pq_is_send_pending() doesn't become false
for a long time, (e.g., the network socket buffer got full due to the
apply worker waiting on a lock), I think users should unblock it by
themselves. Or it might be practically better to shutdown the
subscriber first in the logical replication case, unlike the physical
replication case.

Yeah, will users like such a dependency? And what will they gain by doing so?

IIUC there is no difference between smart shutdown and fast shutdown
in logical replication walsender, but reading the doc[1]https://www.postgresql.org/docs/devel/server-shutdown.html, it seems to
me that in the smart shutdown mode, the server stops existing sessions
normally. For example, If the client is psql that gets stuck for some
reason and the network buffer gets full, the smart shutdown waits for
a backend process to send all results to the client. I think the
logical replication walsender should follow this behavior for
consistency. One idea is to distinguish smart shutdown and fast
shutdown also in logical replication walsender so that we disconnect
even without the done message in fast shutdown mode, but I'm not sure
it's worthwhile.

Regards,

[1]: https://www.postgresql.org/docs/devel/server-shutdown.html

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#28)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Feb 2, 2023 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 1, 2023 at 6:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

In a case where pq_is_send_pending() doesn't become false
for a long time, (e.g., the network socket buffer got full due to the
apply worker waiting on a lock), I think users should unblock it by
themselves. Or it might be practically better to shutdown the
subscriber first in the logical replication case, unlike the physical
replication case.

Yeah, will users like such a dependency? And what will they gain by doing so?

IIUC there is no difference between smart shutdown and fast shutdown
in logical replication walsender, but reading the doc[1], it seems to
me that in the smart shutdown mode, the server stops existing sessions
normally. For example, If the client is psql that gets stuck for some
reason and the network buffer gets full, the smart shutdown waits for
a backend process to send all results to the client. I think the
logical replication walsender should follow this behavior for
consistency. One idea is to distinguish smart shutdown and fast
shutdown also in logical replication walsender so that we disconnect
even without the done message in fast shutdown mode, but I'm not sure
it's worthwhile.

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1]/messages/by-id/TYAPR01MB58669CB06F6657ABCEFE6555F5F29@TYAPR01MB5866.jpnprd01.prod.outlook.com. I haven't tested it but if
such a problem doesn't exist in smart shutdown mode then probably we
can allow walsender to wait till all the data is sent. We can once
investigate what it takes to introduce shutdown mode knowledge for
logical walsender. OTOH, the docs for smart shutdown says "If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected." which to me indicates
that it is okay to terminate logical replication connections even in
smart mode.

[1]: /messages/by-id/TYAPR01MB58669CB06F6657ABCEFE6555F5F29@TYAPR01MB5866.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#27)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Feb 2, 2023 at 10:04 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Otherwise, we will end up terminating
the WAL stream without the done message. Which will lead to an error
message "ERROR: could not receive data from WAL stream: server closed
the connection unexpectedly" on the subscriber even at a clean
shutdown.

But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
disallows new connections, then waits for all existing clients to
disconnect. If the server is in hot standby, recovery and streaming
replication will be terminated once all clients have disconnected.),
there is no such guarantee. I see that it is required for the
switchover in physical replication to ensure that all the WAL is sent
and replicated but we don't need that for logical replication.

+1

Since publisher is not aware of apply-delay (by this patch), as a
matter of fact publisher seems gone before sending EOS in that
case. The error message is correctly describing that situation.

This can happen even without apply-delay patch. For example, when
apply process is waiting on some lock.

--
With Regards,
Amit Kapila.

#31Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#29)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit, Sawada-san,

IIUC there is no difference between smart shutdown and fast shutdown
in logical replication walsender, but reading the doc[1], it seems to
me that in the smart shutdown mode, the server stops existing sessions
normally. For example, If the client is psql that gets stuck for some
reason and the network buffer gets full, the smart shutdown waits for
a backend process to send all results to the client. I think the
logical replication walsender should follow this behavior for
consistency. One idea is to distinguish smart shutdown and fast
shutdown also in logical replication walsender so that we disconnect
even without the done message in fast shutdown mode, but I'm not sure
it's worthwhile.

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1]. I haven't tested it but if
such a problem doesn't exist in smart shutdown mode then probably we
can allow walsender to wait till all the data is sent.

Based on the idea, I made a PoC patch to introduce the smart shutdown to walsenders.
PSA 0002 patch. 0001 is not changed from v5.
When logical walsenders got shutdown request but their send buffer is full due to
the delay, they will:

* wait to complete to send data to subscriber if we are in smart shutdown mode
* exit immediately if we are in fast shutdown mode

Note that in both case, walsender does not wait the remote flush of WALs.

For implementing that, I added new attribute to WalSndCtlData that indicates the
shutdown status. Basically it is zero, but it will be changed by postmaster when
it gets request.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v6-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v6-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload
From 8e926809636f1be474f26cec3d8b3dce80a17a6d Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 22 Dec 2022 02:49:48 +0000
Subject: [PATCH v6 1/2] Exit walsender before confirming remote flush in
 logical replication

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

Author: Hayato Kuroda
---
 doc/src/sgml/logical-replication.sgml | 10 ++++++
 src/backend/replication/walsender.c   | 50 ++++++++++++++++++---------
 2 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 6bd5f61e2b..ccddb8a35a 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1701,6 +1701,16 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
    being synchronized. Moreover, if the streaming transaction is applied in
    parallel, there may be additional parallel apply workers.
   </para>
+
+  <caution>
+   <para>
+    Unlike physical replication, data synchronization by logical replication is
+    more likely to be suspended. It is because workers sometimes wait for
+    acquiring locks and they do not consume messages from the publisher. It
+    will be resolved automatically when workers acquire locks and start
+    consuming arrivals.
+   </para>
+  </caution>
  </sect1>
 
  <sect1 id="logical-replication-security">
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..25a052adfc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/* If we got shut down requested, try to exit the process */
+		if (got_STOPPING)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -2513,18 +2517,14 @@ WalSndLoop(WalSndSendDataCallback send_data)
 										 application_name)));
 				WalSndSetState(WALSNDSTATE_STREAMING);
 			}
-
-			/*
-			 * When SIGUSR2 arrives, we send any outstanding logs up to the
-			 * shutdown checkpoint record (i.e., the latest record), wait for
-			 * them to be replicated to the standby, and exit. This may be a
-			 * normal termination at shutdown, or a promotion, the walsender
-			 * is not sure which.
-			 */
-			if (got_SIGUSR2)
-				WalSndDone(send_data);
 		}
 
+		/*
+		 * When SIGUSR2 arrives, try to exit the process.
+		 */
+		if (got_SIGUSR2)
+			WalSndDone(send_data);
+
 		/* Check for replication timeout. */
 		WalSndCheckTimeOut();
 
@@ -3094,13 +3094,14 @@ XLogSendLogical(void)
 }
 
 /*
- * Shutdown if the sender is caught up.
+ * Shutdown if the sender is we are in a convenient time.
  *
  * NB: This should only be called when the shutdown signal has been received
  * from postmaster.
  *
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * physical replication mode and all WALs are not yet replicated, this function
+ * will return control to the caller.
  */
 static void
 WalSndDone(WalSndSendDataCallback send_data)
@@ -3118,15 +3119,32 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	/*
+	 * Exit if we are in the convenient time.
+	 *
+	 * When we are logical replication mode, we don't have to wait that all
+	 * sent data to be flushed on the subscriber because we cannot support
+	 * clean switchover for it.
+	 */
+	if (WalSndCaughtUp &&
+		(send_data == XLogSendLogical ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+
+		/*
+		 * Flush pending data if writable.
+		 *
+		 * Note that the output buffer may be full in case of logical
+		 * replication. If pq_flush() is called at that time, the walsender
+		 * process will be stuck. Therefore, call pq_flush_if_writable()
+		 * instead.
+		 */
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
-- 
2.27.0

v6-0002-Introduce-smart-shutdown-for-logical-walsender.patchapplication/octet-stream; name=v6-0002-Introduce-smart-shutdown-for-logical-walsender.patchDownload
From 74fc0770c68cbf5605264907ca8cd328f675f9ec Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Fri, 3 Feb 2023 11:17:26 +0000
Subject: [PATCH v6 2/2] Introduce smart shutdown for logical walsender

---
 src/backend/postmaster/postmaster.c         |  4 +++
 src/backend/replication/walsender.c         | 37 ++++++++++++++++++---
 src/include/replication/walsender.h         |  9 +++++
 src/include/replication/walsender_private.h |  7 ++++
 4 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f92dbc2270..8a7d1d4efa 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3813,6 +3813,10 @@ PostmasterStateMachine(void)
 				if (CheckpointerPID != 0)
 				{
 					signal_child(CheckpointerPID, SIGUSR2);
+					if (Shutdown >= FastShutdown)
+						WalSndChangeShutdownState(WALSNDSHUTDOWNSTATE_FASTSHUTDOWN);
+					else
+						WalSndChangeShutdownState(WALSNDSHUTDOWNSTATE_SMARTSHUTDOWN);
 					pmState = PM_SHUTDOWN;
 				}
 				else
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 25a052adfc..820d74c5a7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -260,6 +260,16 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+#define SHUTDOWN_CONDTION_FOR_LOGICAL(send_data)					\
+	((send_data) == XLogSendLogical &&								\
+	 (WalSndCtl->Shutdown == WALSNDSHUTDOWNSTATE_FASTSHUTDOWN ||	\
+	  (WalSndCtl->Shutdown == WALSNDSHUTDOWNSTATE_SMARTSHUTDOWN &&	\
+	   !pq_is_send_pending())))
+
+#define SHUTDOWN_CONDTION_FOR_PHYSICAL(send_data)		\
+	((send_data) == XLogSendPhysical &&					\
+	 sentPtr == replicatedPtr && !pq_is_send_pending())
+
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -3122,13 +3132,18 @@ WalSndDone(WalSndSendDataCallback send_data)
 	/*
 	 * Exit if we are in the convenient time.
 	 *
-	 * When we are logical replication mode, we don't have to wait that all
-	 * sent data to be flushed on the subscriber because we cannot support
-	 * clean switchover for it.
+	 * When we are logical replication mode, the condition for shutdown is
+	 * changed based on the shutdown mode.
+	 *
+	 * For smart shutdown mode, we confirm that there is no pending data
+	 * in the output buffer.
+	 *
+	 * For fast shutdown mode, we don't have to wait that all sent data to be
+	 * flushed on the subscriber.
 	 */
 	if (WalSndCaughtUp &&
-		(send_data == XLogSendLogical ||
-		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
+		(SHUTDOWN_CONDTION_FOR_LOGICAL(send_data) ||
+		 SHUTDOWN_CONDTION_FOR_PHYSICAL(send_data)))
 	{
 		QueryCompletion qc;
 
@@ -3867,3 +3882,15 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Change the indicator for shutdown mode.
+ *
+ * This should be called by postmaster
+ */
+void
+WalSndChangeShutdownState(WalSndShutdownState state)
+{
+	Assert(!IsUnderPostmaster);
+	WalSndCtl->Shutdown = state;
+}
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 52bb3e2aae..97a31e1613 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -24,6 +24,13 @@ typedef enum
 	CRS_USE_SNAPSHOT
 } CRSSnapshotAction;
 
+typedef enum WalSndShutdownState
+{
+	WALSNDSHUTDOWNSTATE_NOSHUTDOWN = 0,
+	WALSNDSHUTDOWNSTATE_SMARTSHUTDOWN,
+	WALSNDSHUTDOWNSTATE_FASTSHUTDOWN,
+} WalSndShutdownState;
+
 /* global state */
 extern PGDLLIMPORT bool am_walsender;
 extern PGDLLIMPORT bool am_cascading_walsender;
@@ -48,6 +55,8 @@ extern void WalSndWaitStopping(void);
 extern void HandleWalSndInitStopping(void);
 extern void WalSndRqstFileReload(void);
 
+extern void WalSndChangeShutdownState(WalSndShutdownState state);
+
 /*
  * Remember that we want to wakeup walsenders later
  *
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5310e054c4..6171c16cb2 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -105,6 +105,13 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Indicator for shutdown mode. Basically it is set to zero, which means
+	 * that shutdown has not been requested yet. The value will be changed
+	 * olny once by postmaster.
+	 */
+	int			Shutdown;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
-- 
2.27.0

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#31)
Re: Exit walsender before confirming remote flush in logical replication

On Fri, Feb 3, 2023 at 5:38 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Amit, Sawada-san,

IIUC there is no difference between smart shutdown and fast shutdown
in logical replication walsender, but reading the doc[1], it seems to
me that in the smart shutdown mode, the server stops existing sessions
normally. For example, If the client is psql that gets stuck for some
reason and the network buffer gets full, the smart shutdown waits for
a backend process to send all results to the client. I think the
logical replication walsender should follow this behavior for
consistency. One idea is to distinguish smart shutdown and fast
shutdown also in logical replication walsender so that we disconnect
even without the done message in fast shutdown mode, but I'm not sure
it's worthwhile.

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1].

For this problem isn't using -t (timeout) avoid it? So, if there is a
pending WAL, users can always use -t option to allow the shutdown to
complete. Now, I agree that it is not very clear how much time to
specify but a user has some option to allow the shutdown to complete.
I am not telling that teaching walsenders about shutdown modes is
completely a bad idea but it doesn't seem necessary to allow shutdowns
to complete.

--
With Regards,
Amit Kapila.

#33Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#29)
Re: Exit walsender before confirming remote flush in logical replication

Hi,

On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1].

Isn't handling this part of the job of wal_sender_timeout?

I don't at all agree that it's ok to just stop replicating changes
because we're blocked on network IO. The patch justifies this with:

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

and at the start of the thread with:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

But that seems a too narrow view to me. Imagine you want to decomission
the current primary, and instead start to use the logical standby as the
primary. For that you'd obviously want to replicate the last few
changes. But with the proposed change, that'd be hard to ever achieve.

Note that even disallowing any writes on the logical primary would make
it hard to be sure that everything is replicated, because autovacuum,
bgwriter, checkpointer all can continue to write WAL. Without being able
to check that the last LSN has indeed been sent out, how do you know
that you didn't miss something?

Greetings,

Andres Freund

#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#33)
Re: Exit walsender before confirming remote flush in logical replication

On Sat, Feb 4, 2023 at 6:31 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1].

Isn't handling this part of the job of wal_sender_timeout?

In some cases, it is not clear whether we can handle it by
wal_sender_timeout. Consider a case of a time-delayed replica where
the applyworker will keep sending some response/alive message so that
walsender doesn't timeout in that (during delay) period. In that case,
because walsender won't timeout, the shutdown will fail (with the
failed message) even though it will be complete after the walsender is
able to send all the WAL and shutdown. The time-delayed replica patch
is still under discussion [1]https://commitfest.postgresql.org/42/3581/. Also, for large values of
wal_sender_timeout, it will wait till the walsender times out and can
return with a failed message.

I don't at all agree that it's ok to just stop replicating changes
because we're blocked on network IO. The patch justifies this with:

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

and at the start of the thread with:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

But that seems a too narrow view to me. Imagine you want to decomission
the current primary, and instead start to use the logical standby as the
primary. For that you'd obviously want to replicate the last few
changes. But with the proposed change, that'd be hard to ever achieve.

I think that can still be achieved with the idea being discussed which
is to keep allowing sending the WAL for smart shutdown mode but not
for other modes(fast, immediate). I don't know whether it is a good
idea or not but Kuroda-San has produced a POC patch for it. We can
instead choose to improve our docs related to shutdown to explain a
bit more about the shutdown's interaction with (logical and physical)
replication. As of now, it says: (“Smart” mode disallows new
connections, then waits for all existing clients to disconnect. If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected.)[2]https://www.postgresql.org/docs/devel/app-pg-ctl.html. Here, it is not
clear that shutdown will wait for sending and flushing all the WALs.
The information for fast and immediate modes is even lesser which
makes it more difficult to understand what kind of behavior is
expected in those modes.

[1]: https://commitfest.postgresql.org/42/3581/
[2]: https://www.postgresql.org/docs/devel/app-pg-ctl.html

--
With Regards,
Amit Kapila.

#35Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#34)
Re: Exit walsender before confirming remote flush in logical replication

Hi,

On February 5, 2023 8:29:19 PM PST, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Feb 4, 2023 at 6:31 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1].

Isn't handling this part of the job of wal_sender_timeout?

In some cases, it is not clear whether we can handle it by
wal_sender_timeout. Consider a case of a time-delayed replica where
the applyworker will keep sending some response/alive message so that
walsender doesn't timeout in that (during delay) period. In that case,
because walsender won't timeout, the shutdown will fail (with the
failed message) even though it will be complete after the walsender is
able to send all the WAL and shutdown. The time-delayed replica patch
is still under discussion [1]. Also, for large values of
wal_sender_timeout, it will wait till the walsender times out and can
return with a failed message.

I don't at all agree that it's ok to just stop replicating changes
because we're blocked on network IO. The patch justifies this with:

Currently, at shutdown, walsender processes wait to send all pending data and
ensure the all data is flushed in remote node. This mechanism was added by
985bd7 for supporting clean switch over, but such use-case cannot be supported
for logical replication. This commit remove the blocking in the case.

and at the start of the thread with:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

But that seems a too narrow view to me. Imagine you want to decomission
the current primary, and instead start to use the logical standby as the
primary. For that you'd obviously want to replicate the last few
changes. But with the proposed change, that'd be hard to ever achieve.

I think that can still be achieved with the idea being discussed which
is to keep allowing sending the WAL for smart shutdown mode but not
for other modes(fast, immediate). I don't know whether it is a good
idea or not but Kuroda-San has produced a POC patch for it. We can
instead choose to improve our docs related to shutdown to explain a
bit more about the shutdown's interaction with (logical and physical)
replication. As of now, it says: (“Smart” mode disallows new
connections, then waits for all existing clients to disconnect. If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected.)[2]. Here, it is not
clear that shutdown will wait for sending and flushing all the WALs.
The information for fast and immediate modes is even lesser which
makes it more difficult to understand what kind of behavior is
expected in those modes.

[1] - https://commitfest.postgresql.org/42/3581/
[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any way.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#35)
Re: Exit walsender before confirming remote flush in logical replication

On Mon, Feb 6, 2023 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:

On February 5, 2023 8:29:19 PM PST, Amit Kapila <amit.kapila16@gmail.com> wrote:

But that seems a too narrow view to me. Imagine you want to decomission
the current primary, and instead start to use the logical standby as the
primary. For that you'd obviously want to replicate the last few
changes. But with the proposed change, that'd be hard to ever achieve.

I think that can still be achieved with the idea being discussed which
is to keep allowing sending the WAL for smart shutdown mode but not
for other modes(fast, immediate). I don't know whether it is a good
idea or not but Kuroda-San has produced a POC patch for it. We can
instead choose to improve our docs related to shutdown to explain a
bit more about the shutdown's interaction with (logical and physical)
replication. As of now, it says: (“Smart” mode disallows new
connections, then waits for all existing clients to disconnect. If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected.)[2]. Here, it is not
clear that shutdown will wait for sending and flushing all the WALs.
The information for fast and immediate modes is even lesser which
makes it more difficult to understand what kind of behavior is
expected in those modes.

[1] - https://commitfest.postgresql.org/42/3581/
[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any way.

So, we have the following options: (a) do nothing for this; (b)
clarify the current behavior in docs. Any suggestions?

--
With Regards,
Amit Kapila.

#37Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#36)
Re: Exit walsender before confirming remote flush in logical replication

Hi,

On 2023-02-06 12:23:54 +0530, Amit Kapila wrote:

On Mon, Feb 6, 2023 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:

Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any way.

So, we have the following options: (a) do nothing for this; (b)
clarify the current behavior in docs. Any suggestions?

b) seems good.

I also think it'd make sense to improve this on a code-level. Just not in the
wholesale way discussed so far.

How about we make it an option in START_REPLICATION? Delayed logical rep can
toggle that on by default.

Greetings,

Andres Freund

#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#37)
Re: Exit walsender before confirming remote flush in logical replication

On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-06 12:23:54 +0530, Amit Kapila wrote:

On Mon, Feb 6, 2023 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:

Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any way.

So, we have the following options: (a) do nothing for this; (b)
clarify the current behavior in docs. Any suggestions?

b) seems good.

I also think it'd make sense to improve this on a code-level. Just not in the
wholesale way discussed so far.

How about we make it an option in START_REPLICATION? Delayed logical rep can
toggle that on by default.

Works for me. So, when this option is set in START_REPLICATION
message, walsender will set some flag and allow itself to exit at
shutdown without waiting for WAL to be sent?

--
With Regards,
Amit Kapila.

#39Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#38)
Re: Exit walsender before confirming remote flush in logical replication

Hi,

On 2023-02-07 09:00:13 +0530, Amit Kapila wrote:

On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

How about we make it an option in START_REPLICATION? Delayed logical rep can
toggle that on by default.

Works for me. So, when this option is set in START_REPLICATION
message, walsender will set some flag and allow itself to exit at
shutdown without waiting for WAL to be sent?

Yes. I think that might be useful in other situations as well, but we don't
need to make those configurable initially. But I imagine it'd be useful to set
things up so that non-HA physical replicas don't delay shutdown, particularly
if they're geographically far away.

Greetings,

Andres Freund

#40Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Andres Freund (#39)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Andres, Amit,

On 2023-02-07 09:00:13 +0530, Amit Kapila wrote:

On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

How about we make it an option in START_REPLICATION? Delayed logical

rep can

toggle that on by default.

Works for me. So, when this option is set in START_REPLICATION
message, walsender will set some flag and allow itself to exit at
shutdown without waiting for WAL to be sent?

Yes. I think that might be useful in other situations as well, but we don't
need to make those configurable initially. But I imagine it'd be useful to set
things up so that non-HA physical replicas don't delay shutdown, particularly
if they're geographically far away.

Based on the discussion, I made a patch for adding a walsender option
exit_before_confirming to the START_STREAMING replication command. It can be
used for both physical and logical replication. I made the patch with
extendibility - it allows adding further options.
And better naming are very welcome.

For physical replication, the grammar was slightly changed like a logical one.
It can now accept options but currently, only one option is allowed. And it is
not used in normal streaming replication. For logical replication, the option is
combined with options for the output plugin. Of course, we can modify the API to
better style.

0001 patch was ported from time-delayed logical replication thread[1]/messages/by-id/TYCPR01MB8373BA483A6D2C924C600968EDDB9@TYCPR01MB8373.jpnprd01.prod.outlook.com, which uses
the added option. When the min_apply_delay option is specified and publisher seems
to be PG16 or later, the apply worker sends a START_REPLICATION query with
exit_before_confirming = true. And the worker will reboot and send START_REPLICATION
again when min_apply_delay is changed from zero to a non-zero value or non-zero to zero.

Note that I removed version number because the approach is completely changed.

[1]: /messages/by-id/TYCPR01MB8373BA483A6D2C924C600968EDDB9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-Time-delayed-logical-replication-subscriber.patchapplication/octet-stream; name=0001-Time-delayed-logical-replication-subscriber.patchDownload
From 89f37aefd1d20fe1d35fb40e41e4d1c2ac1d0ce7 Mon Sep 17 00:00:00 2001
From: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Date: Tue, 7 Feb 2023 13:05:34 +0000
Subject: [PATCH 1/2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

This patch implements a new subscription parameter called 'min_apply_delay'.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs before we start to apply the transaction on the
subscriber. The main reason is to avoid keeping a transaction open for
a long time. Regular and prepared transactions are covered. Streamed
transactions are also covered.

The combination of parallel streaming mode and min_apply_delay is not
allowed. This is because in parallel streaming mode, we start applying
the transaction stream as soon as the first change arrives without
knowing the transaction's prepare/commit time. This means we cannot
calculate the underlying network/decoding lag between publisher and
subscriber, and so always waiting for the full 'min_apply_delay' period
might include unnecessary delay.

The other possibility was to apply the delay at the end of the parallel
apply transaction but that would cause issues related to resource
bloat and locks being held for a long time.

Note that this feature doesn't interact with skip transaction feature.
The skip transaction feature applies to one transaction with a specific LSN.
So, even if the skipped transaction and non-skipped transaction come
consecutively in a very short time, regardless of the order of which comes
first, the time-delayed feature gets balanced by delayed application
for other transactions before and after the skipped transaction.

Author: Euler Taveira, Takamichi Osumi, Kuroda Hayato
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Shveta Malik,
             Kyotaro Horiguchi, Shi Yu, Wang Wei, Dilip Kumar, Melih Mutlu
Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   9 +
 doc/src/sgml/config.sgml                      |  12 ++
 doc/src/sgml/glossary.sgml                    |  14 ++
 doc/src/sgml/logical-replication.sgml         |   6 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  49 ++++-
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/catalog/system_views.sql          |   7 +-
 src/backend/commands/subscriptioncmds.c       | 122 +++++++++++-
 .../replication/logical/applyparallelworker.c |   3 +-
 src/backend/replication/logical/worker.c      | 165 ++++++++++++++--
 src/bin/pg_dump/pg_dump.c                     |  15 +-
 src/bin/pg_dump/pg_dump.h                     |   1 +
 src/bin/psql/describe.c                       |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/replication/worker_internal.h     |   2 +-
 src/test/regress/expected/subscription.out    | 181 +++++++++++-------
 src/test/regress/sql/subscription.sql         |  24 +++
 src/test/subscription/t/001_rep_changes.pl    |  30 +++
 20 files changed, 558 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5dc5ca1133 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7873,6 +7873,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subminapplydelay</structfield> <type>int4</type>
+      </para>
+      <para>
+       The minimum delay, in milliseconds, for applying changes
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subname</structfield> <type>name</type>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d190be1925..626a8b5bd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,18 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        the <filename>postgresql.conf</filename> file or on the server
        command line.
       </para>
+      <para>
+       For time-delayed logical replication, the apply worker sends a feedback
+       message to the publisher every
+       <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+       to set <varname>wal_receiver_status_interval</varname> less than the
+       <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+       <literal>walsender</literal> will repeatedly terminate due to timeout
+       errors. Note that if <varname>wal_receiver_status_interval</varname> is
+       set to zero, the apply worker sends no feedback messages during the
+       <literal>min_apply_delay</literal> period. Refer to
+       <xref linkend="sql-createsubscription"/> for more information.
+      </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..6ed6fa5853 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1729,6 +1729,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-time-delayed-replication">
+   <glossterm>Time-delayed replication</glossterm>
+   <glossdef>
+     <para>
+      Replication setup that applies time-delayed copy of the data.
+    </para>
+    <para>
+     For more information, see
+     <xref linkend="guc-recovery-min-apply-delay"/> for physical replication
+     and <xref linkend="sql-createsubscription"/> for logical replication.
+    </para>
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-toast">
    <glossterm>TOAST</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6bd5f61e2b 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -247,6 +247,12 @@
    target table.
   </para>
 
+  <para>
+   A subscription can delay the application of changes by specifying the
+   <literal>min_apply_delay</literal> subscription parameter. See
+   <xref linkend="sql-createsubscription"/> for details.
+  </para>
+
   <sect2 id="logical-replication-subscription-slot">
    <title>Replication Slot Management</title>
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..8b7eb28e54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -213,8 +213,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
       <literal>binary</literal>, <literal>streaming</literal>,
-      <literal>disable_on_error</literal>, and
-      <literal>origin</literal>.
+      <literal>disable_on_error</literal>,
+      <literal>origin</literal>, and
+      <literal>min_apply_delay</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..1b4b8390af 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -349,7 +349,49 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
-      </variablelist></para>
+
+       <varlistentry>
+        <term><literal>min_apply_delay</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          By default, the subscriber applies changes as soon as possible. This
+          parameter allows the user to delay the application of changes by a
+          given time period. If the value is specified without units, it is
+          taken as milliseconds. The default is zero (no delay). See
+          <xref linkend="config-setting-names-values"/> for details on the
+          available valid time units.
+         </para>
+         <para>
+          Any delay becomes effective only after all initial table
+          synchronization has finished and occurs before each transaction starts
+          to get applied on the subscriber. The delay is calculated as the
+          difference between the WAL timestamp as written on the publisher and
+          the current time on the subscriber. Any overhead of time spent in
+          logical decoding and in transferring the transaction may reduce the
+          actual wait time. It is also possible that the overhead already
+          exceeds the requested <literal>min_apply_delay</literal> value, in
+          which case no delay is applied. If the system clocks on publisher and
+          subscriber are not synchronized, this may lead to apply changes
+          earlier than expected, but this is not a major issue because this
+          parameter is typically much larger than the time deviations between
+          servers. Note that if this parameter is set to a long delay, the
+          replication will stop if the replication slot falls behind the current
+          LSN by more than
+          <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>.
+         </para>
+         <warning>
+           <para>
+            Delaying the replication means there is a much longer time between
+            making a change on the publisher, and that change being committed
+            on the subscriber. This can impact the performance of synchronous
+            replication. See <xref linkend="guc-synchronous-commit"/>
+            parameter.
+           </para>
+         </warning>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
 
     </listitem>
    </varlistentry>
@@ -420,6 +462,11 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
    published with different column lists are not supported.
   </para>
 
+  <para>
+   A non-zero <literal>min_apply_delay</literal> parameter is not allowed when
+   streaming in parallel mode.
+  </para>
+
   <para>
    We allow non-existent publications to be specified so that users can add
    those later. This means
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a56ae311c3..e19e5cbca2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->skiplsn = subform->subskiplsn;
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
+	sub->minapplydelay = subform->subminapplydelay;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..317c2010cb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1299,9 +1299,10 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-              subbinary, substream, subtwophasestate, subdisableonerr,
-              subslotname, subsynccommit, subpublications, suborigin)
+GRANT SELECT (oid, subdbid, subskiplsn, subminapplydelay, subname, subowner,
+              subenabled, subbinary, substream, subtwophasestate,
+              subdisableonerr, subslotname, subsynccommit, subpublications,
+              suborigin)
     ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..82e16fd0f9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,6 +66,7 @@
 #define SUBOPT_DISABLE_ON_ERR		0x00000400
 #define SUBOPT_LSN					0x00000800
 #define SUBOPT_ORIGIN				0x00001000
+#define SUBOPT_MIN_APPLY_DELAY		0x00002000
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -90,6 +91,7 @@ typedef struct SubOpts
 	bool		disableonerr;
 	char	   *origin;
 	XLogRecPtr	lsn;
+	int32		min_apply_delay;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -100,7 +102,7 @@ static void check_publications_origin(WalReceiverConn *wrconn,
 static void check_duplicates_in_publist(List *publist, Datum *datums);
 static List *merge_publications(List *oldpublist, List *newpublist, bool addpub, const char *subname);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
-
+static int32 defGetMinApplyDelay(DefElem *def);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -146,6 +148,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->disableonerr = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY))
+		opts->min_apply_delay = 0;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -324,6 +328,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_LSN;
 			opts->lsn = lsn;
 		}
+		else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+				 strcmp(defel->defname, "min_apply_delay") == 0)
+		{
+			if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				errorConflictingDefElem(defel, pstate);
+
+			opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY;
+			opts->min_apply_delay = defGetMinApplyDelay(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -404,6 +417,32 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 								"slot_name = NONE", "create_slot = false")));
 		}
 	}
+
+	/*
+	 * The combination of parallel streaming mode and min_apply_delay is not
+	 * allowed. This is because in parallel streaming mode, we start applying
+	 * the transaction stream as soon as the first change arrives without
+	 * knowing the transaction's prepare/commit time. This means we cannot
+	 * calculate the underlying network/decoding lag between publisher and
+	 * subscriber, and so always waiting for the full 'min_apply_delay' period
+	 * might include unnecessary delay.
+	 *
+	 * The other possibility was to apply the delay at the end of the parallel
+	 * apply transaction but that would cause issues related to resource bloat
+	 * and locks being held for a long time.
+	 */
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+		opts->min_apply_delay > 0 &&
+		opts->streaming == LOGICALREP_STREAM_PARALLEL)
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+
+		/*
+		 * translator: the first %s is a string of the form "parameter > 0"
+		 * and the second one is "option = value".
+		 */
+				errmsg("%s and %s are mutually exclusive options",
+					   "min_apply_delay > 0", "streaming = parallel"));
 }
 
 /*
@@ -560,7 +599,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
+					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN |
+					  SUBOPT_MIN_APPLY_DELAY);
 	parse_subscription_options(pstate, stmt->options, supported_opts, &opts);
 
 	/*
@@ -625,6 +665,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
+	values[Anum_pg_subscription_subminapplydelay - 1] = Int32GetDatum(opts.min_apply_delay);
 	values[Anum_pg_subscription_subname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
@@ -1054,7 +1095,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 				supported_opts = (SUBOPT_SLOT_NAME |
 								  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 								  SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR |
-								  SUBOPT_ORIGIN);
+								  SUBOPT_ORIGIN | SUBOPT_MIN_APPLY_DELAY);
 
 				parse_subscription_options(pstate, stmt->options,
 										   supported_opts, &opts);
@@ -1098,6 +1139,19 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.streaming == LOGICALREP_STREAM_PARALLEL &&
+						!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)
+						&& sub->minapplydelay > 0)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set parallel streaming mode for subscription with %s",
+									   "min_apply_delay"));
+
 					values[Anum_pg_subscription_substream - 1] =
 						CharGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -1111,6 +1165,26 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 						= true;
 				}
 
+				if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.min_apply_delay > 0 &&
+						!IsSet(opts.specified_opts, SUBOPT_STREAMING)
+						&& sub->stream == LOGICALREP_STREAM_PARALLEL)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set %s for subscription in parallel streaming mode",
+									   "min_apply_delay"));
+
+					values[Anum_pg_subscription_subminapplydelay - 1] =
+						Int32GetDatum(opts.min_apply_delay);
+					replaces[Anum_pg_subscription_subminapplydelay - 1] = true;
+				}
+
 				if (IsSet(opts.specified_opts, SUBOPT_ORIGIN))
 				{
 					values[Anum_pg_subscription_suborigin - 1] =
@@ -2195,3 +2269,45 @@ defGetStreamingMode(DefElem *def)
 					def->defname)));
 	return LOGICALREP_STREAM_OFF;	/* keep compiler quiet */
 }
+
+/*
+ * Extract the min_apply_delay value from a DefElem. This is very similar to
+ * parse_and_validate_value() for integer values, because min_apply_delay
+ * accepts the same parameter format as recovery_min_apply_delay.
+ */
+static int32
+defGetMinApplyDelay(DefElem *def)
+{
+	char	   *input_string;
+	int			result;
+	const char *hintmsg;
+
+	input_string = defGetString(def);
+
+	/*
+	 * Parse given string as parameter which has millisecond unit
+	 */
+	if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for parameter \"%s\": \"%s\"",
+						"min_apply_delay", input_string),
+				 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+
+	/*
+	 * Check both the lower boundary for the valid min_apply_delay range and
+	 * the upper boundary as the safeguard for some platforms where INT_MAX is
+	 * wider than int32 respectively. Although parse_int() has confirmed that
+	 * the result is less than or equal to INT_MAX, the value will be stored
+	 * in a catalog column of int32.
+	 */
+	if (result < 0 || result > PG_INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)",
+						result,
+						"min_apply_delay",
+						0, PG_INT32_MAX)));
+
+	return result;
+}
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index da437e0bc3..32db20fd98 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void)
 	{
 		apply_spooled_messages(&MyParallelShared->fileset,
 							   MyParallelShared->xid,
-							   InvalidXLogRecPtr);
+							   InvalidXLogRecPtr,
+							   0);
 		pa_set_fileset_state(MyParallelShared, FS_EMPTY);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6248..c574531040 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -319,6 +319,17 @@ static List *on_commit_wakeup_workers_subids = NIL;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/*
+ * In order to avoid walsender timeout for time-delayed logical replication the
+ * apply worker keeps sending feedback messages during the delay period.
+ * Meanwhile, the feature delays the apply before the start of the
+ * transaction and thus we don't write WAL records for the suspended changes
+ * during the wait. When the apply worker sends a feedback message during the
+ * delay, we should not overwrite positions of the flushed and apply LSN by the
+ * last received latest LSN. See send_feedback() for details.
+ */
+static XLogRecPtr last_received = InvalidXLogRecPtr;
+
 /* fields valid only when processing streamed transaction */
 static bool in_streamed_transaction = false;
 
@@ -389,7 +400,8 @@ static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
 static void stream_close_file(void);
 
-static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+						  bool has_unprocessed_change);
 
 static void DisableSubscriptionAndExit(void);
 
@@ -999,6 +1011,109 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 	ExecStoreVirtualTuple(slot);
 }
 
+/*
+ * When min_apply_delay parameter is set on the subscriber, we wait long enough
+ * to make sure a transaction is applied at least that period behind the
+ * publisher.
+ *
+ * While the physical replication applies the delay at commit time, this
+ * feature applies the delay for the next transaction but before starting the
+ * transaction. This is mainly because keeping a transaction that conducted
+ * write operations open for a long time results in some issues such as bloat
+ * and locks.
+ *
+ * The min_apply_delay parameter will take effect only after all tables are in
+ * READY state.
+ *
+ * xid is the transaction id where we apply the delay.
+ *
+ * finish_ts is the commit/prepare time of both regular (non-streamed) and
+ * streamed transactions. Unlike the regular (non-streamed) cases, the delay
+ * is applied in a STREAM COMMIT/STREAM PREPARE message for streamed
+ * transactions. The STREAM START message does not contain a commit/prepare
+ * time (it will be available when the in-progress transaction finishes).
+ * Hence, it's not appropriate to apply a delay at the STREAM START time.
+ */
+static void
+maybe_apply_delay(TransactionId xid, TimestampTz finish_ts)
+{
+	Assert(finish_ts > 0);
+
+	/* Nothing to do if no delay set */
+	if (!MySubscription->minapplydelay)
+		return;
+
+	/*
+	 * The min_apply_delay parameter is ignored until all tablesync workers
+	 * have reached READY state. This is because if we allowed the delay
+	 * during the catchup phase, then once we reached the limit of tablesync
+	 * workers it would impose a delay for each subsequent worker. That would
+	 * cause initial table synchronization completion to take a long time.
+	 */
+	if (!AllTablesyncsReady())
+		return;
+
+	/* Apply the delay by the latch mechanism */
+	while (true)
+	{
+		TimestampTz delayUntil;
+		long		diffms;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* This might change wal_receiver_status_interval */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/*
+		 * Before calculating the time duration, reload the catalog if needed.
+		 */
+		if (!in_remote_transaction && !in_streamed_transaction)
+		{
+			AcceptInvalidationMessages();
+			maybe_reread_subscription();
+		}
+
+		delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay);
+		diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+		/*
+		 * Exit without arming the latch if it's already past time to apply
+		 * this transaction.
+		 */
+		if (diffms <= 0)
+			break;
+
+		elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms",
+			 xid, MySubscription->minapplydelay, diffms);
+
+		/*
+		 * Call send_feedback() to prevent the publisher from exiting by
+		 * timeout during the delay, when wal_receiver_status_interval is
+		 * available.
+		 */
+		if (wal_receiver_status_interval > 0 &&
+			diffms > wal_receiver_status_interval * 1000L)
+		{
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  wal_receiver_status_interval * 1000L,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+			send_feedback(last_received, true, false, true);
+		}
+		else
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  diffms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+	}
+}
+
 /*
  * Handle BEGIN message.
  */
@@ -1013,6 +1128,9 @@ apply_handle_begin(StringInfo s)
 	logicalrep_read_begin(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
 
+	/* Should we delay the current transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.committime);
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	maybe_start_skipping_changes(begin_data.final_lsn);
@@ -1070,6 +1188,9 @@ apply_handle_begin_prepare(StringInfo s)
 	logicalrep_read_begin_prepare(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
 
+	/* Should we delay the current prepared transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.prepare_time);
+
 	remote_final_lsn = begin_data.prepare_lsn;
 
 	maybe_start_skipping_changes(begin_data.prepare_lsn);
@@ -1317,7 +1438,8 @@ apply_handle_stream_prepare(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
-								   prepare_data.xid, prepare_data.prepare_lsn);
+								   prepare_data.xid, prepare_data.prepare_lsn,
+								   prepare_data.prepare_time);
 
 			/* Mark the transaction as prepared. */
 			apply_handle_prepare_internal(&prepare_data);
@@ -2011,10 +2133,13 @@ ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno,
 
 /*
  * Common spoolfile processing.
+ *
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */
 void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-					   XLogRecPtr lsn)
+					   XLogRecPtr lsn, TimestampTz finish_ts)
 {
 	StringInfoData s2;
 	int			nchanges;
@@ -2025,6 +2150,10 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
 	int			fileno;
 	off_t		offset;
 
+	/* Should we delay the current transaction? */
+	if (finish_ts)
+		maybe_apply_delay(xid, finish_ts);
+
 	if (!am_parallel_apply_worker())
 		maybe_start_skipping_changes(lsn);
 
@@ -2174,7 +2303,7 @@ apply_handle_stream_commit(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
-								   commit_data.commit_lsn);
+								   commit_data.commit_lsn, commit_data.committime);
 
 			apply_handle_commit_internal(&commit_data);
 
@@ -3447,7 +3576,7 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
  * Apply main loop.
  */
 static void
-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)
 {
 	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
 	bool		ping_sent = false;
@@ -3568,7 +3697,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						if (last_received < end_lsn)
 							last_received = end_lsn;
 
-						send_feedback(last_received, reply_requested, false);
+						send_feedback(last_received, reply_requested, false, false);
 						UpdateWorkerStats(last_received, timestamp, true);
 					}
 					/* other message types are purposefully ignored */
@@ -3581,7 +3710,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		}
 
 		/* confirm all writes so far */
-		send_feedback(last_received, false, false);
+		send_feedback(last_received, false, false, false);
 
 		if (!in_remote_transaction && !in_streamed_transaction)
 		{
@@ -3678,7 +3807,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				}
 			}
 
-			send_feedback(last_received, requestReply, requestReply);
+			send_feedback(last_received, requestReply, requestReply, false);
 
 			/*
 			 * Force reporting to ensure long idle periods don't lead to
@@ -3708,7 +3837,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  * to send a response to avoid timeouts.
  */
 static void
-send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
+send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, bool has_unprocessed_change)
 {
 	static StringInfo reply_message = NULL;
 	static TimestampTz send_time = 0;
@@ -3738,8 +3867,14 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	/*
 	 * No outstanding transactions to flush, we can report the latest received
 	 * position. This is important for synchronous replication.
+	 *
+	 * If the logical replication subscription has unprocessed changes then do
+	 * not inform the publisher that the received latest LSN is already
+	 * applied and flushed, otherwise, the publisher will make a wrong
+	 * assumption about the logical replication progress. Instead, just send a
+	 * feedback message to avoid a replication timeout during the delay.
 	 */
-	if (!have_pending_txes)
+	if (!have_pending_txes && !has_unprocessed_change)
 		flushpos = writepos = recvpos;
 
 	if (writepos < last_writepos)
@@ -3776,8 +3911,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
-	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
+	elog(DEBUG2, "sending feedback (force %d, has_unprocessed_change %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
+		 has_unprocessed_change,
 		 LSN_FORMAT_ARGS(recvpos),
 		 LSN_FORMAT_ARGS(writepos),
 		 LSN_FORMAT_ARGS(flushpos));
@@ -4367,11 +4503,11 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
  * of system resource error and are not repeatable.
  */
 static void
-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)
 {
 	PG_TRY();
 	{
-		LogicalRepApplyLoop(origin_startpos);
+		LogicalRepApplyLoop();
 	}
 	PG_CATCH();
 	{
@@ -4661,7 +4797,8 @@ ApplyWorkerMain(Datum main_arg)
 	}
 
 	/* Run the main loop. */
-	start_apply(origin_startpos);
+	last_received = origin_startpos;
+	start_apply();
 
 	proc_exit(0);
 }
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..1e87f0124e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4494,6 +4494,7 @@ getSubscriptions(Archive *fout)
 	int			i_subsynccommit;
 	int			i_subpublications;
 	int			i_subbinary;
+	int			i_subminapplydelay;
 	int			i,
 				ntups;
 
@@ -4546,9 +4547,13 @@ getSubscriptions(Archive *fout)
 						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	if (fout->remoteVersion >= 160000)
-		appendPQExpBufferStr(query, " s.suborigin\n");
+		appendPQExpBufferStr(query,
+							 " s.suborigin,\n"
+							 " s.subminapplydelay\n");
 	else
-		appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY);
+		appendPQExpBuffer(query, " '%s' AS suborigin,\n"
+						  " 0 AS subminapplydelay\n",
+						  LOGICALREP_ORIGIN_ANY);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4576,6 +4581,7 @@ getSubscriptions(Archive *fout)
 	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 	i_subdisableonerr = PQfnumber(res, "subdisableonerr");
 	i_suborigin = PQfnumber(res, "suborigin");
+	i_subminapplydelay = PQfnumber(res, "subminapplydelay");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4606,6 +4612,8 @@ getSubscriptions(Archive *fout)
 		subinfo[i].subdisableonerr =
 			pg_strdup(PQgetvalue(res, i, i_subdisableonerr));
 		subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin));
+		subinfo[i].subminapplydelay =
+			atoi(PQgetvalue(res, i, i_subminapplydelay));
 
 		/* Decide whether we want to dump it */
 		selectDumpableObject(&(subinfo[i].dobj), fout);
@@ -4687,6 +4695,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
+	if (subinfo->subminapplydelay > 0)
+		appendPQExpBuffer(query, ", min_apply_delay = '%d ms'", subinfo->subminapplydelay);
+
 	appendPQExpBufferStr(query, ");\n");
 
 	if (subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..b8831c3ed3 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo
 	char	   *subdisableonerr;
 	char	   *suborigin;
 	char	   *subsynccommit;
+	int			subminapplydelay;
 	char	   *subpublications;
 } SubscriptionInfo;
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c8a0bb7b3a..81d4607a1c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6472,7 +6472,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false, false, false, false, false};
+	false, false, false, false, false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6527,10 +6527,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Two-phase commit"),
 							  gettext_noop("Disable on error"));
 
+		/* Origin and min_apply_delay are only supported in v16 and higher */
 		if (pset.sversion >= 160000)
 			appendPQExpBuffer(&buf,
-							  ", suborigin AS \"%s\"\n",
-							  gettext_noop("Origin"));
+							  ", suborigin AS \"%s\"\n"
+							  ", subminapplydelay AS \"%s\"\n",
+							  gettext_noop("Origin"),
+							  gettext_noop("Min apply delay"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e1882eaea..e8b9a43a47 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1925,7 +1925,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "PUBLICATION");
 	/* ALTER SUBSCRIPTION <name> SET ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SET", "("))
-		COMPLETE_WITH("binary", "disable_on_error", "origin", "slot_name",
+		COMPLETE_WITH("binary", "disable_on_error", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit");
 	/* ALTER SUBSCRIPTION <name> SKIP ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SKIP", "("))
@@ -3268,7 +3268,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-					  "disable_on_error", "enabled", "origin", "slot_name",
+					  "disable_on_error", "enabled", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index b0f2a1705d..d1cfefc6d6 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	Oid			subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
 
+	int32		subminapplydelay;	/* Replication apply delay (ms) */
+
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
@@ -122,6 +124,7 @@ typedef struct Subscription
 								 * skipped */
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
+	int32		minapplydelay;	/* Replication apply delay (ms) */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index dc87a4edd1..3dc09d1a4c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -255,7 +255,7 @@ extern void stream_stop_internal(TransactionId xid);
 
 /* Common streaming function to apply all the spooled messages */
 extern void apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-								   XLogRecPtr lsn);
+								   XLogRecPtr lsn, TimestampTz finish_ts);
 
 extern void apply_dispatch(StringInfo s);
 
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 3f99b14394..cf8e727ee9 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -114,18 +114,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION 'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub4 SET (origin = any);
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub3;
@@ -143,10 +143,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -163,10 +163,10 @@ ERROR:  unrecognized subscription parameter: "create_slot"
 -- ok
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/12345');
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/12345
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/12345
 (1 row)
 
 -- ok - with lsn = NONE
@@ -175,10 +175,10 @@ ALTER SUBSCRIPTION regress_testsub SKIP (lsn = NONE);
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/0');
 ERROR:  invalid WAL location (LSN): 0/0
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/0
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 BEGIN;
@@ -210,10 +210,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                                               List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
----------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | local              | dbname=regress_doesnotexist2 | 0/0
+                                                                                                        List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | local              | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 -- rename back to keep the rest simple
@@ -247,19 +247,19 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -271,27 +271,27 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication already exists
@@ -306,10 +306,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                                                 List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication used more than once
@@ -324,10 +324,10 @@ ERROR:  publication "testpub3" is not in subscription "regress_testsub"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -363,10 +363,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 --fail - alter of two_phase option not supported.
@@ -375,10 +375,10 @@ ERROR:  unrecognized subscription parameter: "two_phase"
 -- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -388,10 +388,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -404,20 +404,57 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+ERROR:  invalid value for parameter "min_apply_delay": "foo"
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+ERROR:  -1 ms is outside the valid range for parameter "min_apply_delay" (0 .. 2147483647)
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+ERROR:  min_apply_delay > 0 and streaming = parallel are mutually exclusive options
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |             123 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |        86400000 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+ERROR:  cannot set parallel streaming mode for subscription with min_apply_delay
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ERROR:  cannot set min_apply_delay for subscription in parallel streaming mode
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 7281f5fee2..7317b140f5 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -286,6 +286,30 @@ ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+\dRs+
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 91aa068c95..f94819672b 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -515,6 +515,36 @@ $node_publisher->poll_query_until('postgres',
   or die
   "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
 
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# look the time duration between tuples are inserted on publisher and then
+# changes are replicated on subscriber.
+my $delay = 3;
+
+# Set min_apply_delay parameter to 3 seconds
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+
+# Make new content on publisher and check its presence in subscriber depending
+# on the delay applied above. Before doing the insertion, get the
+# current timestamp that will be used as a comparison base. Even on slow
+# machines, this allows to have a predictable behavior when comparing the
+# delay between data insertion moment on publisher and replay time on subscriber.
+my $publisher_insert_time = time();
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_ins VALUES (generate_series(1101, 1120))");
+
+# The publisher waits for the replication to complete
+$node_publisher->wait_for_catchup('tap_sub_renamed');
+
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+	"subscriber applies WAL only after replication delay for non-streaming transaction"
+);
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
-- 
2.27.0

0002-Extend-START_REPLICATION-command-to-accept-walsender.patchapplication/octet-stream; name=0002-Extend-START_REPLICATION-command-to-accept-walsender.patchDownload
From 5406a52c1f827f5cbf44399f52f9c0192ffc52d3 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 7 Feb 2023 05:38:20 +0000
Subject: [PATCH 2/2] Extend START_REPLICATION command to accept walsender
 options

This commit extends START_REPLICATION to accept options of walsender. Currently,
only one option exit_before_confirming is accepted.

For physical replication, the grammer of START_REPLICATION is extended to accept
options. Note that in the normal phyical replication the added option is never
used.

For logical replication, the option list for logical decoding plugin is reused for
storing walsender options. When the min_apply_delay parameter is set for a
subscription, the apply worker related with it will send START_REPLICATION query
with exit_before_confirming = true to publisher node.

This option allows primay servers to shut down even if there are pending WALs to
be sent or sent WALs are not flushed on the secondary. This may be useful to
shut down the primary even when the walreceiver/worker is stuck.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/protocol.sgml                    | 21 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c       |  4 +
 src/backend/replication/logical/worker.c      | 13 ++-
 src/backend/replication/repl_gram.y           |  8 +-
 src/backend/replication/walsender.c           | 87 ++++++++++++++++++-
 src/include/replication/walreceiver.h         |  1 +
 src/test/subscription/t/001_rep_changes.pl    | 10 ++-
 src/tools/pgindent/typedefs.list              |  1 +
 8 files changed, 138 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 93fc7167d4..9c84d57cfb 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2192,7 +2192,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
     </varlistentry>
 
     <varlistentry id="protocol-replication-start-replication">
-     <term><literal>START_REPLICATION</literal> [ <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> ] [ <literal>PHYSICAL</literal> ] <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>TIMELINE</literal> <replaceable class="parameter">tli</replaceable> ]
+     <term><literal>START_REPLICATION</literal> [ <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> ] [ <literal>PHYSICAL</literal> ] <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>TIMELINE</literal> <replaceable class="parameter">tli</replaceable> ] [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]
       <indexterm><primary>START_REPLICATION</primary></indexterm>
      </term>
      <listitem>
@@ -2496,6 +2496,25 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
         </listitem>
        </varlistentry>
       </variablelist>
+
+      <para>
+       If further options are given, we can control the behavior of the
+       walsender more detailed. Currently the following option is accepted:
+      </para>
+
+      <variablelist>
+       <varlistentry>
+        <term>exit_before_confirming</term>
+        <listitem>
+         <para>
+          If set to true, the walsender will exit before confirming the remote
+          flush of WALs at shutdown. This can be useful when the network lag
+          between nodes are large and it takes time to shut down the server.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+
      </listitem>
     </varlistentry>
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 560ec974fa..8bf8e03063 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -443,6 +443,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", binary 'true'");
 
+		if (options->proto.logical.exit_before_confirming &&
+			PQserverVersion(conn->streamConn) >= 160000)
+			appendStringInfoString(&cmd, ", exit_before_confirming 'true'");
+
 		appendStringInfoChar(&cmd, ')');
 	}
 	else
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c574531040..d768bafd3e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4034,7 +4034,9 @@ maybe_reread_subscription(void)
 		newsub->stream != MySubscription->stream ||
 		strcmp(newsub->origin, MySubscription->origin) != 0 ||
 		newsub->owner != MySubscription->owner ||
-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		(newsub->minapplydelay > 0 && MySubscription->minapplydelay == 0) ||
+		(newsub->minapplydelay == 0 && MySubscription->minapplydelay > 0))
 	{
 		if (am_parallel_apply_worker())
 			ereport(LOG,
@@ -4756,6 +4758,15 @@ ApplyWorkerMain(Datum main_arg)
 
 	if (!am_tablesync_worker())
 	{
+		/*
+		 * time-delayed logical replication does not support tablesync
+		 * workers, so only the leader apply worker can request walsenders to
+		 * exit before confirming remote flush.
+		 */
+		if (server_version >= 160000)
+			options.proto.logical.exit_before_confirming =
+				MySubscription->minapplydelay > 0;
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..1705d52a58 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -91,6 +91,7 @@ Node *replication_parse_result;
 %type <boolval>	opt_temporary
 %type <list>	create_slot_options create_slot_legacy_opt_list
 %type <defelt>	create_slot_legacy_opt
+%type <list>	walsender_options
 
 %%
 
@@ -261,7 +262,7 @@ drop_replication_slot:
  * START_REPLICATION [SLOT slot] [PHYSICAL] %X/%X [TIMELINE %d]
  */
 start_replication:
-			K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline
+			K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline walsender_options
 				{
 					StartReplicationCmd *cmd;
 
@@ -270,6 +271,7 @@ start_replication:
 					cmd->slotname = $2;
 					cmd->startpoint = $4;
 					cmd->timeline = $5;
+					cmd->options = $6;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -336,6 +338,10 @@ opt_timeline:
 				| /* EMPTY */			{ $$ = 0; }
 			;
 
+walsender_options:
+			'(' generic_option_list ')'			{ $$ = $2; }
+			| /* EMPTY */					{ $$ = NIL; }
+		;
 
 plugin_options:
 			'(' plugin_opt_list ')'			{ $$ = $2; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..1bbcb8adf1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -219,6 +219,22 @@ typedef struct
 
 static LagTracker *lag_tracker;
 
+/*
+ * If set to true, the walsender will exit before confirming flush of remote
+ * WALs and whether the send buffer is empty.
+ */
+static bool exit_before_confirming = false;
+
+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+	bool		exit_before_confirming;
+} WalSndData;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -260,6 +276,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+static void ConsumeWalsenderOptions(List *options, WalSndData *data);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -672,6 +689,7 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 	TimeLineID	FlushTLI;
+	WalSndData	data;
 
 	/* create xlogreader for physical replication */
 	xlogreader =
@@ -710,6 +728,12 @@ StartReplication(StartReplicationCmd *cmd)
 		 */
 	}
 
+	/* Check given options and set flags accordingly */
+	ConsumeWalsenderOptions(cmd->options, &data);
+
+	if (data.exit_before_confirming)
+		exit_before_confirming = true;
+
 	/*
 	 * Select the timeline. If it was given explicitly by the client, use
 	 * that. Otherwise use the timeline of the last replayed record.
@@ -1245,6 +1269,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 {
 	StringInfoData buf;
 	QueryCompletion qc;
+	WalSndData	data;
 
 	/* make sure that our requirements are still fulfilled */
 	CheckLogicalDecodingRequirements();
@@ -1272,6 +1297,12 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/* Check given options and set flags accordingly */
+	ConsumeWalsenderOptions(cmd->options, &data);
+
+	if (data.exit_before_confirming)
+		exit_before_confirming = true;
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1450,6 +1481,9 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		if (exit_before_confirming)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -3118,15 +3152,16 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	if (WalSndCaughtUp &&
+		(exit_before_confirming ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
@@ -3849,3 +3884,49 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Reads all entrly of the list and consume if needed.
+ *
+ * In logical replication mode, the given list may contain both walsender and
+ * output_plugin options, and it leads "unrecognized pgoutput option" ERROR.
+ * Therefore, the entry for walsender options will be eliminated from the list
+ * if we found.
+ */
+static void
+ConsumeWalsenderOptions(List *options, WalSndData *data)
+{
+	ListCell   *lc;
+	bool		exit_before_confirming_given = false;
+
+	foreach(lc, options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(lc);
+
+		Assert(defel->arg == NULL || IsA(defel->arg, String));
+
+		/* Check each param, whether or not we recognize it */
+		if (strcmp(defel->defname, "exit_before_confirming") == 0)
+		{
+			if (exit_before_confirming_given)
+				ereport(ERROR,
+						errcode(ERRCODE_SYNTAX_ERROR),
+						errmsg("conflicting or redundant options"));
+			exit_before_confirming_given = true;
+
+			data->exit_before_confirming = defGetBoolean(defel);
+
+			/*
+			 * Elimitates current element, because the list may be bypassed to
+			 * the pgoutput module and it will raise an ERROR due to the
+			 * unrecognized option.
+			 */
+			options = foreach_delete_current(options, lc);
+		}
+
+		/*
+		 * ERROR is not raised here even if the given parameter is not known,
+		 * because it may be written for the output plugin.
+		 */
+	}
+}
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index decffe352d..f801fb3e0d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -187,6 +187,7 @@ typedef struct
 									 * prepare time */
 			char	   *origin; /* Only publish data originating from the
 								 * specified origin */
+			bool		exit_before_confirming;
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index f94819672b..d7a6fd0e38 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -523,9 +523,17 @@ $node_publisher->poll_query_until('postgres',
 # changes are replicated on subscriber.
 my $delay = 3;
 
-# Set min_apply_delay parameter to 3 seconds
+# check restart on changing min_apply_delay to 3 seconds
+$oldpid = $node_publisher->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+);
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for apply to restart after changing min_apply_delay to non-zero value";
 
 # Make new content on publisher and check its presence in subscriber depending
 # on the delay applied above. Before doing the insertion, get the
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..3b7f8eb063 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2968,6 +2968,7 @@ WalReceiverConn
 WalReceiverFunctionsType
 WalSnd
 WalSndCtlData
+WalSndData
 WalSndSendDataCallback
 WalSndState
 WalTimeSample
-- 
2.27.0

#41Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#40)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Andres, Amit,

On 2023-02-07 09:00:13 +0530, Amit Kapila wrote:

On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

How about we make it an option in START_REPLICATION? Delayed logical

rep can

toggle that on by default.

Works for me. So, when this option is set in START_REPLICATION
message, walsender will set some flag and allow itself to exit at
shutdown without waiting for WAL to be sent?

Yes. I think that might be useful in other situations as well, but we don't
need to make those configurable initially. But I imagine it'd be useful to set
things up so that non-HA physical replicas don't delay shutdown, particularly
if they're geographically far away.

Based on the discussion, I made a patch for adding a walsender option
exit_before_confirming to the START_STREAMING replication command. It can
be
used for both physical and logical replication. I made the patch with
extendibility - it allows adding further options.
And better naming are very welcome.

For physical replication, the grammar was slightly changed like a logical one.
It can now accept options but currently, only one option is allowed. And it is
not used in normal streaming replication. For logical replication, the option is
combined with options for the output plugin. Of course, we can modify the API to
better style.

0001 patch was ported from time-delayed logical replication thread[1], which
uses
the added option. When the min_apply_delay option is specified and publisher
seems
to be PG16 or later, the apply worker sends a START_REPLICATION query with
exit_before_confirming = true. And the worker will reboot and send
START_REPLICATION
again when min_apply_delay is changed from zero to a non-zero value or non-zero
to zero.

Note that I removed version number because the approach is completely changed.

[1]:
/messages/by-id/TYCPR01MB8373BA483A6D2C924C60
0968EDDB9@TYCPR01MB8373.jpnprd01.prod.outlook.com

I noticed that previous ones are rejected by cfbot, even if they passed on my environment...
PSA fixed version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v2-0001-Time-delayed-logical-replication-subscriber.patchapplication/octet-stream; name=v2-0001-Time-delayed-logical-replication-subscriber.patchDownload
From f20c835021e2d2fe157312523f863fc8f6e4b0e3 Mon Sep 17 00:00:00 2001
From: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Date: Tue, 7 Feb 2023 13:05:34 +0000
Subject: [PATCH v2 1/2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

This patch implements a new subscription parameter called 'min_apply_delay'.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs before we start to apply the transaction on the
subscriber. The main reason is to avoid keeping a transaction open for
a long time. Regular and prepared transactions are covered. Streamed
transactions are also covered.

The combination of parallel streaming mode and min_apply_delay is not
allowed. This is because in parallel streaming mode, we start applying
the transaction stream as soon as the first change arrives without
knowing the transaction's prepare/commit time. This means we cannot
calculate the underlying network/decoding lag between publisher and
subscriber, and so always waiting for the full 'min_apply_delay' period
might include unnecessary delay.

The other possibility was to apply the delay at the end of the parallel
apply transaction but that would cause issues related to resource
bloat and locks being held for a long time.

Note that this feature doesn't interact with skip transaction feature.
The skip transaction feature applies to one transaction with a specific LSN.
So, even if the skipped transaction and non-skipped transaction come
consecutively in a very short time, regardless of the order of which comes
first, the time-delayed feature gets balanced by delayed application
for other transactions before and after the skipped transaction.

Author: Euler Taveira, Takamichi Osumi, Kuroda Hayato
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Shveta Malik,
             Kyotaro Horiguchi, Shi Yu, Wang Wei, Dilip Kumar, Melih Mutlu
Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   9 +
 doc/src/sgml/config.sgml                      |  12 ++
 doc/src/sgml/glossary.sgml                    |  14 ++
 doc/src/sgml/logical-replication.sgml         |   6 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  49 ++++-
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/catalog/system_views.sql          |   7 +-
 src/backend/commands/subscriptioncmds.c       | 122 +++++++++++-
 .../replication/logical/applyparallelworker.c |   3 +-
 src/backend/replication/logical/worker.c      | 165 ++++++++++++++--
 src/bin/pg_dump/pg_dump.c                     |  15 +-
 src/bin/pg_dump/pg_dump.h                     |   1 +
 src/bin/psql/describe.c                       |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/replication/worker_internal.h     |   2 +-
 src/test/regress/expected/subscription.out    | 181 +++++++++++-------
 src/test/regress/sql/subscription.sql         |  24 +++
 src/test/subscription/t/001_rep_changes.pl    |  30 +++
 20 files changed, 558 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5dc5ca1133 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7873,6 +7873,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subminapplydelay</structfield> <type>int4</type>
+      </para>
+      <para>
+       The minimum delay, in milliseconds, for applying changes
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subname</structfield> <type>name</type>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d190be1925..626a8b5bd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,18 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        the <filename>postgresql.conf</filename> file or on the server
        command line.
       </para>
+      <para>
+       For time-delayed logical replication, the apply worker sends a feedback
+       message to the publisher every
+       <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+       to set <varname>wal_receiver_status_interval</varname> less than the
+       <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+       <literal>walsender</literal> will repeatedly terminate due to timeout
+       errors. Note that if <varname>wal_receiver_status_interval</varname> is
+       set to zero, the apply worker sends no feedback messages during the
+       <literal>min_apply_delay</literal> period. Refer to
+       <xref linkend="sql-createsubscription"/> for more information.
+      </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..6ed6fa5853 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1729,6 +1729,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-time-delayed-replication">
+   <glossterm>Time-delayed replication</glossterm>
+   <glossdef>
+     <para>
+      Replication setup that applies time-delayed copy of the data.
+    </para>
+    <para>
+     For more information, see
+     <xref linkend="guc-recovery-min-apply-delay"/> for physical replication
+     and <xref linkend="sql-createsubscription"/> for logical replication.
+    </para>
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-toast">
    <glossterm>TOAST</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6bd5f61e2b 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -247,6 +247,12 @@
    target table.
   </para>
 
+  <para>
+   A subscription can delay the application of changes by specifying the
+   <literal>min_apply_delay</literal> subscription parameter. See
+   <xref linkend="sql-createsubscription"/> for details.
+  </para>
+
   <sect2 id="logical-replication-subscription-slot">
    <title>Replication Slot Management</title>
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..8b7eb28e54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -213,8 +213,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
       <literal>binary</literal>, <literal>streaming</literal>,
-      <literal>disable_on_error</literal>, and
-      <literal>origin</literal>.
+      <literal>disable_on_error</literal>,
+      <literal>origin</literal>, and
+      <literal>min_apply_delay</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..1b4b8390af 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -349,7 +349,49 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
-      </variablelist></para>
+
+       <varlistentry>
+        <term><literal>min_apply_delay</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          By default, the subscriber applies changes as soon as possible. This
+          parameter allows the user to delay the application of changes by a
+          given time period. If the value is specified without units, it is
+          taken as milliseconds. The default is zero (no delay). See
+          <xref linkend="config-setting-names-values"/> for details on the
+          available valid time units.
+         </para>
+         <para>
+          Any delay becomes effective only after all initial table
+          synchronization has finished and occurs before each transaction starts
+          to get applied on the subscriber. The delay is calculated as the
+          difference between the WAL timestamp as written on the publisher and
+          the current time on the subscriber. Any overhead of time spent in
+          logical decoding and in transferring the transaction may reduce the
+          actual wait time. It is also possible that the overhead already
+          exceeds the requested <literal>min_apply_delay</literal> value, in
+          which case no delay is applied. If the system clocks on publisher and
+          subscriber are not synchronized, this may lead to apply changes
+          earlier than expected, but this is not a major issue because this
+          parameter is typically much larger than the time deviations between
+          servers. Note that if this parameter is set to a long delay, the
+          replication will stop if the replication slot falls behind the current
+          LSN by more than
+          <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>.
+         </para>
+         <warning>
+           <para>
+            Delaying the replication means there is a much longer time between
+            making a change on the publisher, and that change being committed
+            on the subscriber. This can impact the performance of synchronous
+            replication. See <xref linkend="guc-synchronous-commit"/>
+            parameter.
+           </para>
+         </warning>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
 
     </listitem>
    </varlistentry>
@@ -420,6 +462,11 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
    published with different column lists are not supported.
   </para>
 
+  <para>
+   A non-zero <literal>min_apply_delay</literal> parameter is not allowed when
+   streaming in parallel mode.
+  </para>
+
   <para>
    We allow non-existent publications to be specified so that users can add
    those later. This means
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a56ae311c3..e19e5cbca2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->skiplsn = subform->subskiplsn;
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
+	sub->minapplydelay = subform->subminapplydelay;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..317c2010cb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1299,9 +1299,10 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-              subbinary, substream, subtwophasestate, subdisableonerr,
-              subslotname, subsynccommit, subpublications, suborigin)
+GRANT SELECT (oid, subdbid, subskiplsn, subminapplydelay, subname, subowner,
+              subenabled, subbinary, substream, subtwophasestate,
+              subdisableonerr, subslotname, subsynccommit, subpublications,
+              suborigin)
     ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..82e16fd0f9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,6 +66,7 @@
 #define SUBOPT_DISABLE_ON_ERR		0x00000400
 #define SUBOPT_LSN					0x00000800
 #define SUBOPT_ORIGIN				0x00001000
+#define SUBOPT_MIN_APPLY_DELAY		0x00002000
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -90,6 +91,7 @@ typedef struct SubOpts
 	bool		disableonerr;
 	char	   *origin;
 	XLogRecPtr	lsn;
+	int32		min_apply_delay;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -100,7 +102,7 @@ static void check_publications_origin(WalReceiverConn *wrconn,
 static void check_duplicates_in_publist(List *publist, Datum *datums);
 static List *merge_publications(List *oldpublist, List *newpublist, bool addpub, const char *subname);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
-
+static int32 defGetMinApplyDelay(DefElem *def);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -146,6 +148,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->disableonerr = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY))
+		opts->min_apply_delay = 0;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -324,6 +328,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_LSN;
 			opts->lsn = lsn;
 		}
+		else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+				 strcmp(defel->defname, "min_apply_delay") == 0)
+		{
+			if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				errorConflictingDefElem(defel, pstate);
+
+			opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY;
+			opts->min_apply_delay = defGetMinApplyDelay(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -404,6 +417,32 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 								"slot_name = NONE", "create_slot = false")));
 		}
 	}
+
+	/*
+	 * The combination of parallel streaming mode and min_apply_delay is not
+	 * allowed. This is because in parallel streaming mode, we start applying
+	 * the transaction stream as soon as the first change arrives without
+	 * knowing the transaction's prepare/commit time. This means we cannot
+	 * calculate the underlying network/decoding lag between publisher and
+	 * subscriber, and so always waiting for the full 'min_apply_delay' period
+	 * might include unnecessary delay.
+	 *
+	 * The other possibility was to apply the delay at the end of the parallel
+	 * apply transaction but that would cause issues related to resource bloat
+	 * and locks being held for a long time.
+	 */
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+		opts->min_apply_delay > 0 &&
+		opts->streaming == LOGICALREP_STREAM_PARALLEL)
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+
+		/*
+		 * translator: the first %s is a string of the form "parameter > 0"
+		 * and the second one is "option = value".
+		 */
+				errmsg("%s and %s are mutually exclusive options",
+					   "min_apply_delay > 0", "streaming = parallel"));
 }
 
 /*
@@ -560,7 +599,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
+					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN |
+					  SUBOPT_MIN_APPLY_DELAY);
 	parse_subscription_options(pstate, stmt->options, supported_opts, &opts);
 
 	/*
@@ -625,6 +665,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
+	values[Anum_pg_subscription_subminapplydelay - 1] = Int32GetDatum(opts.min_apply_delay);
 	values[Anum_pg_subscription_subname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
@@ -1054,7 +1095,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 				supported_opts = (SUBOPT_SLOT_NAME |
 								  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 								  SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR |
-								  SUBOPT_ORIGIN);
+								  SUBOPT_ORIGIN | SUBOPT_MIN_APPLY_DELAY);
 
 				parse_subscription_options(pstate, stmt->options,
 										   supported_opts, &opts);
@@ -1098,6 +1139,19 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.streaming == LOGICALREP_STREAM_PARALLEL &&
+						!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)
+						&& sub->minapplydelay > 0)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set parallel streaming mode for subscription with %s",
+									   "min_apply_delay"));
+
 					values[Anum_pg_subscription_substream - 1] =
 						CharGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -1111,6 +1165,26 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 						= true;
 				}
 
+				if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.min_apply_delay > 0 &&
+						!IsSet(opts.specified_opts, SUBOPT_STREAMING)
+						&& sub->stream == LOGICALREP_STREAM_PARALLEL)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set %s for subscription in parallel streaming mode",
+									   "min_apply_delay"));
+
+					values[Anum_pg_subscription_subminapplydelay - 1] =
+						Int32GetDatum(opts.min_apply_delay);
+					replaces[Anum_pg_subscription_subminapplydelay - 1] = true;
+				}
+
 				if (IsSet(opts.specified_opts, SUBOPT_ORIGIN))
 				{
 					values[Anum_pg_subscription_suborigin - 1] =
@@ -2195,3 +2269,45 @@ defGetStreamingMode(DefElem *def)
 					def->defname)));
 	return LOGICALREP_STREAM_OFF;	/* keep compiler quiet */
 }
+
+/*
+ * Extract the min_apply_delay value from a DefElem. This is very similar to
+ * parse_and_validate_value() for integer values, because min_apply_delay
+ * accepts the same parameter format as recovery_min_apply_delay.
+ */
+static int32
+defGetMinApplyDelay(DefElem *def)
+{
+	char	   *input_string;
+	int			result;
+	const char *hintmsg;
+
+	input_string = defGetString(def);
+
+	/*
+	 * Parse given string as parameter which has millisecond unit
+	 */
+	if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for parameter \"%s\": \"%s\"",
+						"min_apply_delay", input_string),
+				 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+
+	/*
+	 * Check both the lower boundary for the valid min_apply_delay range and
+	 * the upper boundary as the safeguard for some platforms where INT_MAX is
+	 * wider than int32 respectively. Although parse_int() has confirmed that
+	 * the result is less than or equal to INT_MAX, the value will be stored
+	 * in a catalog column of int32.
+	 */
+	if (result < 0 || result > PG_INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)",
+						result,
+						"min_apply_delay",
+						0, PG_INT32_MAX)));
+
+	return result;
+}
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index da437e0bc3..32db20fd98 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void)
 	{
 		apply_spooled_messages(&MyParallelShared->fileset,
 							   MyParallelShared->xid,
-							   InvalidXLogRecPtr);
+							   InvalidXLogRecPtr,
+							   0);
 		pa_set_fileset_state(MyParallelShared, FS_EMPTY);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6248..c574531040 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -319,6 +319,17 @@ static List *on_commit_wakeup_workers_subids = NIL;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/*
+ * In order to avoid walsender timeout for time-delayed logical replication the
+ * apply worker keeps sending feedback messages during the delay period.
+ * Meanwhile, the feature delays the apply before the start of the
+ * transaction and thus we don't write WAL records for the suspended changes
+ * during the wait. When the apply worker sends a feedback message during the
+ * delay, we should not overwrite positions of the flushed and apply LSN by the
+ * last received latest LSN. See send_feedback() for details.
+ */
+static XLogRecPtr last_received = InvalidXLogRecPtr;
+
 /* fields valid only when processing streamed transaction */
 static bool in_streamed_transaction = false;
 
@@ -389,7 +400,8 @@ static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
 static void stream_close_file(void);
 
-static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+						  bool has_unprocessed_change);
 
 static void DisableSubscriptionAndExit(void);
 
@@ -999,6 +1011,109 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 	ExecStoreVirtualTuple(slot);
 }
 
+/*
+ * When min_apply_delay parameter is set on the subscriber, we wait long enough
+ * to make sure a transaction is applied at least that period behind the
+ * publisher.
+ *
+ * While the physical replication applies the delay at commit time, this
+ * feature applies the delay for the next transaction but before starting the
+ * transaction. This is mainly because keeping a transaction that conducted
+ * write operations open for a long time results in some issues such as bloat
+ * and locks.
+ *
+ * The min_apply_delay parameter will take effect only after all tables are in
+ * READY state.
+ *
+ * xid is the transaction id where we apply the delay.
+ *
+ * finish_ts is the commit/prepare time of both regular (non-streamed) and
+ * streamed transactions. Unlike the regular (non-streamed) cases, the delay
+ * is applied in a STREAM COMMIT/STREAM PREPARE message for streamed
+ * transactions. The STREAM START message does not contain a commit/prepare
+ * time (it will be available when the in-progress transaction finishes).
+ * Hence, it's not appropriate to apply a delay at the STREAM START time.
+ */
+static void
+maybe_apply_delay(TransactionId xid, TimestampTz finish_ts)
+{
+	Assert(finish_ts > 0);
+
+	/* Nothing to do if no delay set */
+	if (!MySubscription->minapplydelay)
+		return;
+
+	/*
+	 * The min_apply_delay parameter is ignored until all tablesync workers
+	 * have reached READY state. This is because if we allowed the delay
+	 * during the catchup phase, then once we reached the limit of tablesync
+	 * workers it would impose a delay for each subsequent worker. That would
+	 * cause initial table synchronization completion to take a long time.
+	 */
+	if (!AllTablesyncsReady())
+		return;
+
+	/* Apply the delay by the latch mechanism */
+	while (true)
+	{
+		TimestampTz delayUntil;
+		long		diffms;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* This might change wal_receiver_status_interval */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/*
+		 * Before calculating the time duration, reload the catalog if needed.
+		 */
+		if (!in_remote_transaction && !in_streamed_transaction)
+		{
+			AcceptInvalidationMessages();
+			maybe_reread_subscription();
+		}
+
+		delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay);
+		diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+		/*
+		 * Exit without arming the latch if it's already past time to apply
+		 * this transaction.
+		 */
+		if (diffms <= 0)
+			break;
+
+		elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms",
+			 xid, MySubscription->minapplydelay, diffms);
+
+		/*
+		 * Call send_feedback() to prevent the publisher from exiting by
+		 * timeout during the delay, when wal_receiver_status_interval is
+		 * available.
+		 */
+		if (wal_receiver_status_interval > 0 &&
+			diffms > wal_receiver_status_interval * 1000L)
+		{
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  wal_receiver_status_interval * 1000L,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+			send_feedback(last_received, true, false, true);
+		}
+		else
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  diffms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+	}
+}
+
 /*
  * Handle BEGIN message.
  */
@@ -1013,6 +1128,9 @@ apply_handle_begin(StringInfo s)
 	logicalrep_read_begin(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
 
+	/* Should we delay the current transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.committime);
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	maybe_start_skipping_changes(begin_data.final_lsn);
@@ -1070,6 +1188,9 @@ apply_handle_begin_prepare(StringInfo s)
 	logicalrep_read_begin_prepare(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
 
+	/* Should we delay the current prepared transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.prepare_time);
+
 	remote_final_lsn = begin_data.prepare_lsn;
 
 	maybe_start_skipping_changes(begin_data.prepare_lsn);
@@ -1317,7 +1438,8 @@ apply_handle_stream_prepare(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
-								   prepare_data.xid, prepare_data.prepare_lsn);
+								   prepare_data.xid, prepare_data.prepare_lsn,
+								   prepare_data.prepare_time);
 
 			/* Mark the transaction as prepared. */
 			apply_handle_prepare_internal(&prepare_data);
@@ -2011,10 +2133,13 @@ ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno,
 
 /*
  * Common spoolfile processing.
+ *
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */
 void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-					   XLogRecPtr lsn)
+					   XLogRecPtr lsn, TimestampTz finish_ts)
 {
 	StringInfoData s2;
 	int			nchanges;
@@ -2025,6 +2150,10 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
 	int			fileno;
 	off_t		offset;
 
+	/* Should we delay the current transaction? */
+	if (finish_ts)
+		maybe_apply_delay(xid, finish_ts);
+
 	if (!am_parallel_apply_worker())
 		maybe_start_skipping_changes(lsn);
 
@@ -2174,7 +2303,7 @@ apply_handle_stream_commit(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
-								   commit_data.commit_lsn);
+								   commit_data.commit_lsn, commit_data.committime);
 
 			apply_handle_commit_internal(&commit_data);
 
@@ -3447,7 +3576,7 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
  * Apply main loop.
  */
 static void
-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)
 {
 	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
 	bool		ping_sent = false;
@@ -3568,7 +3697,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						if (last_received < end_lsn)
 							last_received = end_lsn;
 
-						send_feedback(last_received, reply_requested, false);
+						send_feedback(last_received, reply_requested, false, false);
 						UpdateWorkerStats(last_received, timestamp, true);
 					}
 					/* other message types are purposefully ignored */
@@ -3581,7 +3710,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		}
 
 		/* confirm all writes so far */
-		send_feedback(last_received, false, false);
+		send_feedback(last_received, false, false, false);
 
 		if (!in_remote_transaction && !in_streamed_transaction)
 		{
@@ -3678,7 +3807,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				}
 			}
 
-			send_feedback(last_received, requestReply, requestReply);
+			send_feedback(last_received, requestReply, requestReply, false);
 
 			/*
 			 * Force reporting to ensure long idle periods don't lead to
@@ -3708,7 +3837,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  * to send a response to avoid timeouts.
  */
 static void
-send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
+send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, bool has_unprocessed_change)
 {
 	static StringInfo reply_message = NULL;
 	static TimestampTz send_time = 0;
@@ -3738,8 +3867,14 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	/*
 	 * No outstanding transactions to flush, we can report the latest received
 	 * position. This is important for synchronous replication.
+	 *
+	 * If the logical replication subscription has unprocessed changes then do
+	 * not inform the publisher that the received latest LSN is already
+	 * applied and flushed, otherwise, the publisher will make a wrong
+	 * assumption about the logical replication progress. Instead, just send a
+	 * feedback message to avoid a replication timeout during the delay.
 	 */
-	if (!have_pending_txes)
+	if (!have_pending_txes && !has_unprocessed_change)
 		flushpos = writepos = recvpos;
 
 	if (writepos < last_writepos)
@@ -3776,8 +3911,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
-	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
+	elog(DEBUG2, "sending feedback (force %d, has_unprocessed_change %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
+		 has_unprocessed_change,
 		 LSN_FORMAT_ARGS(recvpos),
 		 LSN_FORMAT_ARGS(writepos),
 		 LSN_FORMAT_ARGS(flushpos));
@@ -4367,11 +4503,11 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
  * of system resource error and are not repeatable.
  */
 static void
-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)
 {
 	PG_TRY();
 	{
-		LogicalRepApplyLoop(origin_startpos);
+		LogicalRepApplyLoop();
 	}
 	PG_CATCH();
 	{
@@ -4661,7 +4797,8 @@ ApplyWorkerMain(Datum main_arg)
 	}
 
 	/* Run the main loop. */
-	start_apply(origin_startpos);
+	last_received = origin_startpos;
+	start_apply();
 
 	proc_exit(0);
 }
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..1e87f0124e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4494,6 +4494,7 @@ getSubscriptions(Archive *fout)
 	int			i_subsynccommit;
 	int			i_subpublications;
 	int			i_subbinary;
+	int			i_subminapplydelay;
 	int			i,
 				ntups;
 
@@ -4546,9 +4547,13 @@ getSubscriptions(Archive *fout)
 						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	if (fout->remoteVersion >= 160000)
-		appendPQExpBufferStr(query, " s.suborigin\n");
+		appendPQExpBufferStr(query,
+							 " s.suborigin,\n"
+							 " s.subminapplydelay\n");
 	else
-		appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY);
+		appendPQExpBuffer(query, " '%s' AS suborigin,\n"
+						  " 0 AS subminapplydelay\n",
+						  LOGICALREP_ORIGIN_ANY);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4576,6 +4581,7 @@ getSubscriptions(Archive *fout)
 	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 	i_subdisableonerr = PQfnumber(res, "subdisableonerr");
 	i_suborigin = PQfnumber(res, "suborigin");
+	i_subminapplydelay = PQfnumber(res, "subminapplydelay");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4606,6 +4612,8 @@ getSubscriptions(Archive *fout)
 		subinfo[i].subdisableonerr =
 			pg_strdup(PQgetvalue(res, i, i_subdisableonerr));
 		subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin));
+		subinfo[i].subminapplydelay =
+			atoi(PQgetvalue(res, i, i_subminapplydelay));
 
 		/* Decide whether we want to dump it */
 		selectDumpableObject(&(subinfo[i].dobj), fout);
@@ -4687,6 +4695,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
+	if (subinfo->subminapplydelay > 0)
+		appendPQExpBuffer(query, ", min_apply_delay = '%d ms'", subinfo->subminapplydelay);
+
 	appendPQExpBufferStr(query, ");\n");
 
 	if (subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..b8831c3ed3 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo
 	char	   *subdisableonerr;
 	char	   *suborigin;
 	char	   *subsynccommit;
+	int			subminapplydelay;
 	char	   *subpublications;
 } SubscriptionInfo;
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c8a0bb7b3a..81d4607a1c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6472,7 +6472,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false, false, false, false, false};
+	false, false, false, false, false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6527,10 +6527,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Two-phase commit"),
 							  gettext_noop("Disable on error"));
 
+		/* Origin and min_apply_delay are only supported in v16 and higher */
 		if (pset.sversion >= 160000)
 			appendPQExpBuffer(&buf,
-							  ", suborigin AS \"%s\"\n",
-							  gettext_noop("Origin"));
+							  ", suborigin AS \"%s\"\n"
+							  ", subminapplydelay AS \"%s\"\n",
+							  gettext_noop("Origin"),
+							  gettext_noop("Min apply delay"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e1882eaea..e8b9a43a47 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1925,7 +1925,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "PUBLICATION");
 	/* ALTER SUBSCRIPTION <name> SET ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SET", "("))
-		COMPLETE_WITH("binary", "disable_on_error", "origin", "slot_name",
+		COMPLETE_WITH("binary", "disable_on_error", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit");
 	/* ALTER SUBSCRIPTION <name> SKIP ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SKIP", "("))
@@ -3268,7 +3268,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-					  "disable_on_error", "enabled", "origin", "slot_name",
+					  "disable_on_error", "enabled", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index b0f2a1705d..d1cfefc6d6 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	Oid			subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
 
+	int32		subminapplydelay;	/* Replication apply delay (ms) */
+
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
@@ -122,6 +124,7 @@ typedef struct Subscription
 								 * skipped */
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
+	int32		minapplydelay;	/* Replication apply delay (ms) */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index dc87a4edd1..3dc09d1a4c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -255,7 +255,7 @@ extern void stream_stop_internal(TransactionId xid);
 
 /* Common streaming function to apply all the spooled messages */
 extern void apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-								   XLogRecPtr lsn);
+								   XLogRecPtr lsn, TimestampTz finish_ts);
 
 extern void apply_dispatch(StringInfo s);
 
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 3f99b14394..cf8e727ee9 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -114,18 +114,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION 'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub4 SET (origin = any);
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub3;
@@ -143,10 +143,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -163,10 +163,10 @@ ERROR:  unrecognized subscription parameter: "create_slot"
 -- ok
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/12345');
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/12345
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/12345
 (1 row)
 
 -- ok - with lsn = NONE
@@ -175,10 +175,10 @@ ALTER SUBSCRIPTION regress_testsub SKIP (lsn = NONE);
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/0');
 ERROR:  invalid WAL location (LSN): 0/0
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/0
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 BEGIN;
@@ -210,10 +210,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                                               List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
----------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | local              | dbname=regress_doesnotexist2 | 0/0
+                                                                                                        List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | local              | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 -- rename back to keep the rest simple
@@ -247,19 +247,19 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -271,27 +271,27 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication already exists
@@ -306,10 +306,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                                                 List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication used more than once
@@ -324,10 +324,10 @@ ERROR:  publication "testpub3" is not in subscription "regress_testsub"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -363,10 +363,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 --fail - alter of two_phase option not supported.
@@ -375,10 +375,10 @@ ERROR:  unrecognized subscription parameter: "two_phase"
 -- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -388,10 +388,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -404,20 +404,57 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+ERROR:  invalid value for parameter "min_apply_delay": "foo"
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+ERROR:  -1 ms is outside the valid range for parameter "min_apply_delay" (0 .. 2147483647)
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+ERROR:  min_apply_delay > 0 and streaming = parallel are mutually exclusive options
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |             123 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |        86400000 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+ERROR:  cannot set parallel streaming mode for subscription with min_apply_delay
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ERROR:  cannot set min_apply_delay for subscription in parallel streaming mode
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 7281f5fee2..7317b140f5 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -286,6 +286,30 @@ ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+\dRs+
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 91aa068c95..f94819672b 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -515,6 +515,36 @@ $node_publisher->poll_query_until('postgres',
   or die
   "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
 
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# look the time duration between tuples are inserted on publisher and then
+# changes are replicated on subscriber.
+my $delay = 3;
+
+# Set min_apply_delay parameter to 3 seconds
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+
+# Make new content on publisher and check its presence in subscriber depending
+# on the delay applied above. Before doing the insertion, get the
+# current timestamp that will be used as a comparison base. Even on slow
+# machines, this allows to have a predictable behavior when comparing the
+# delay between data insertion moment on publisher and replay time on subscriber.
+my $publisher_insert_time = time();
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_ins VALUES (generate_series(1101, 1120))");
+
+# The publisher waits for the replication to complete
+$node_publisher->wait_for_catchup('tap_sub_renamed');
+
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+	"subscriber applies WAL only after replication delay for non-streaming transaction"
+);
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
-- 
2.27.0

v2-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchapplication/octet-stream; name=v2-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchDownload
From bd040e383902430d1bd7a9d9e8b3553c377c90c6 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 7 Feb 2023 05:38:20 +0000
Subject: [PATCH v2 2/2] Extend START_REPLICATION command to accept walsender
 options

This commit extends START_REPLICATION to accept options of walsender. Currently,
only one option exit_before_confirming is accepted.

For physical replication, the grammer of START_REPLICATION is extended to accept
options. Note that in the normal phyical replication the added option is never
used.

For logical replication, the option list for logical decoding plugin is reused for
storing walsender options. When the min_apply_delay parameter is set for a
subscription, the apply worker related with it will send START_REPLICATION query
with exit_before_confirming = true to publisher node.

This option allows primay servers to shut down even if there are pending WALs to
be sent or sent WALs are not flushed on the secondary. This may be useful to
shut down the primary even when the walreceiver/worker is stuck.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/protocol.sgml                    | 21 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c       |  4 +
 src/backend/replication/logical/worker.c      | 13 ++-
 src/backend/replication/repl_gram.y           |  8 +-
 src/backend/replication/walsender.c           | 87 ++++++++++++++++++-
 src/include/replication/walreceiver.h         |  1 +
 src/test/subscription/t/001_rep_changes.pl    | 10 ++-
 src/tools/pgindent/typedefs.list              |  1 +
 8 files changed, 138 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 93fc7167d4..9c84d57cfb 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2192,7 +2192,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
     </varlistentry>
 
     <varlistentry id="protocol-replication-start-replication">
-     <term><literal>START_REPLICATION</literal> [ <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> ] [ <literal>PHYSICAL</literal> ] <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>TIMELINE</literal> <replaceable class="parameter">tli</replaceable> ]
+     <term><literal>START_REPLICATION</literal> [ <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> ] [ <literal>PHYSICAL</literal> ] <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>TIMELINE</literal> <replaceable class="parameter">tli</replaceable> ] [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]
       <indexterm><primary>START_REPLICATION</primary></indexterm>
      </term>
      <listitem>
@@ -2496,6 +2496,25 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
         </listitem>
        </varlistentry>
       </variablelist>
+
+      <para>
+       If further options are given, we can control the behavior of the
+       walsender more detailed. Currently the following option is accepted:
+      </para>
+
+      <variablelist>
+       <varlistentry>
+        <term>exit_before_confirming</term>
+        <listitem>
+         <para>
+          If set to true, the walsender will exit before confirming the remote
+          flush of WALs at shutdown. This can be useful when the network lag
+          between nodes are large and it takes time to shut down the server.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+
      </listitem>
     </varlistentry>
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 560ec974fa..8bf8e03063 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -443,6 +443,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", binary 'true'");
 
+		if (options->proto.logical.exit_before_confirming &&
+			PQserverVersion(conn->streamConn) >= 160000)
+			appendStringInfoString(&cmd, ", exit_before_confirming 'true'");
+
 		appendStringInfoChar(&cmd, ')');
 	}
 	else
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c574531040..d768bafd3e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4034,7 +4034,9 @@ maybe_reread_subscription(void)
 		newsub->stream != MySubscription->stream ||
 		strcmp(newsub->origin, MySubscription->origin) != 0 ||
 		newsub->owner != MySubscription->owner ||
-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		(newsub->minapplydelay > 0 && MySubscription->minapplydelay == 0) ||
+		(newsub->minapplydelay == 0 && MySubscription->minapplydelay > 0))
 	{
 		if (am_parallel_apply_worker())
 			ereport(LOG,
@@ -4756,6 +4758,15 @@ ApplyWorkerMain(Datum main_arg)
 
 	if (!am_tablesync_worker())
 	{
+		/*
+		 * time-delayed logical replication does not support tablesync
+		 * workers, so only the leader apply worker can request walsenders to
+		 * exit before confirming remote flush.
+		 */
+		if (server_version >= 160000)
+			options.proto.logical.exit_before_confirming =
+				MySubscription->minapplydelay > 0;
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..1705d52a58 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -91,6 +91,7 @@ Node *replication_parse_result;
 %type <boolval>	opt_temporary
 %type <list>	create_slot_options create_slot_legacy_opt_list
 %type <defelt>	create_slot_legacy_opt
+%type <list>	walsender_options
 
 %%
 
@@ -261,7 +262,7 @@ drop_replication_slot:
  * START_REPLICATION [SLOT slot] [PHYSICAL] %X/%X [TIMELINE %d]
  */
 start_replication:
-			K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline
+			K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline walsender_options
 				{
 					StartReplicationCmd *cmd;
 
@@ -270,6 +271,7 @@ start_replication:
 					cmd->slotname = $2;
 					cmd->startpoint = $4;
 					cmd->timeline = $5;
+					cmd->options = $6;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -336,6 +338,10 @@ opt_timeline:
 				| /* EMPTY */			{ $$ = 0; }
 			;
 
+walsender_options:
+			'(' generic_option_list ')'			{ $$ = $2; }
+			| /* EMPTY */					{ $$ = NIL; }
+		;
 
 plugin_options:
 			'(' plugin_opt_list ')'			{ $$ = $2; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..bcc7c38080 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -219,6 +219,22 @@ typedef struct
 
 static LagTracker *lag_tracker;
 
+/*
+ * If set to true, the walsender will exit before confirming flush of remote
+ * WALs and whether the send buffer is empty.
+ */
+static bool exit_before_confirming = false;
+
+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+	bool		exit_before_confirming;
+} WalSndData;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -260,6 +276,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+static void ConsumeWalsenderOptions(List *options, WalSndData *data);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -672,6 +689,7 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 	TimeLineID	FlushTLI;
+	WalSndData	data = {0};
 
 	/* create xlogreader for physical replication */
 	xlogreader =
@@ -710,6 +728,12 @@ StartReplication(StartReplicationCmd *cmd)
 		 */
 	}
 
+	/* Check given options and set flags accordingly */
+	ConsumeWalsenderOptions(cmd->options, &data);
+
+	if (data.exit_before_confirming)
+		exit_before_confirming = true;
+
 	/*
 	 * Select the timeline. If it was given explicitly by the client, use
 	 * that. Otherwise use the timeline of the last replayed record.
@@ -1245,6 +1269,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 {
 	StringInfoData buf;
 	QueryCompletion qc;
+	WalSndData	data = {0};
 
 	/* make sure that our requirements are still fulfilled */
 	CheckLogicalDecodingRequirements();
@@ -1272,6 +1297,12 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/* Check given options and set flags accordingly */
+	ConsumeWalsenderOptions(cmd->options, &data);
+
+	if (data.exit_before_confirming)
+		exit_before_confirming = true;
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1450,6 +1481,9 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		if (exit_before_confirming)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -3118,15 +3152,16 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	if (WalSndCaughtUp &&
+		(exit_before_confirming ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
@@ -3849,3 +3884,49 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Reads all entrly of the list and consume if needed.
+ *
+ * In logical replication mode, the given list may contain both walsender and
+ * output_plugin options, and it leads "unrecognized pgoutput option" ERROR.
+ * Therefore, the entry for walsender options will be eliminated from the list
+ * if we found.
+ */
+static void
+ConsumeWalsenderOptions(List *options, WalSndData *data)
+{
+	ListCell   *lc;
+	bool		exit_before_confirming_given = false;
+
+	foreach(lc, options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(lc);
+
+		Assert(defel->arg == NULL || IsA(defel->arg, String));
+
+		/* Check each param, whether or not we recognize it */
+		if (strcmp(defel->defname, "exit_before_confirming") == 0)
+		{
+			if (exit_before_confirming_given)
+				ereport(ERROR,
+						errcode(ERRCODE_SYNTAX_ERROR),
+						errmsg("conflicting or redundant options"));
+			exit_before_confirming_given = true;
+
+			data->exit_before_confirming = defGetBoolean(defel);
+
+			/*
+			 * Elimitates current element, because the list may be bypassed to
+			 * the pgoutput module and it will raise an ERROR due to the
+			 * unrecognized option.
+			 */
+			options = foreach_delete_current(options, lc);
+		}
+
+		/*
+		 * ERROR is not raised here even if the given parameter is not known,
+		 * because it may be written for the output plugin.
+		 */
+	}
+}
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index decffe352d..f801fb3e0d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -187,6 +187,7 @@ typedef struct
 									 * prepare time */
 			char	   *origin; /* Only publish data originating from the
 								 * specified origin */
+			bool		exit_before_confirming;
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index f94819672b..d7a6fd0e38 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -523,9 +523,17 @@ $node_publisher->poll_query_until('postgres',
 # changes are replicated on subscriber.
 my $delay = 3;
 
-# Set min_apply_delay parameter to 3 seconds
+# check restart on changing min_apply_delay to 3 seconds
+$oldpid = $node_publisher->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+);
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for apply to restart after changing min_apply_delay to non-zero value";
 
 # Make new content on publisher and check its presence in subscriber depending
 # on the delay applied above. Before doing the insertion, get the
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..3b7f8eb063 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2968,6 +2968,7 @@ WalReceiverConn
 WalReceiverFunctionsType
 WalSnd
 WalSndCtlData
+WalSndData
 WalSndSendDataCallback
 WalSndState
 WalTimeSample
-- 
2.27.0

#42Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#41)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

I noticed that previous ones are rejected by cfbot, even if they passed on my
environment...
PSA fixed version.

While analyzing more, I found the further bug that forgets initialization.
PSA new version that could be passed automated tests on my github repository.
Sorry for noise.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v3-0001-Time-delayed-logical-replication-subscriber.patchapplication/octet-stream; name=v3-0001-Time-delayed-logical-replication-subscriber.patchDownload
From f20c835021e2d2fe157312523f863fc8f6e4b0e3 Mon Sep 17 00:00:00 2001
From: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Date: Tue, 7 Feb 2023 13:05:34 +0000
Subject: [PATCH v3 1/2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

This patch implements a new subscription parameter called 'min_apply_delay'.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs before we start to apply the transaction on the
subscriber. The main reason is to avoid keeping a transaction open for
a long time. Regular and prepared transactions are covered. Streamed
transactions are also covered.

The combination of parallel streaming mode and min_apply_delay is not
allowed. This is because in parallel streaming mode, we start applying
the transaction stream as soon as the first change arrives without
knowing the transaction's prepare/commit time. This means we cannot
calculate the underlying network/decoding lag between publisher and
subscriber, and so always waiting for the full 'min_apply_delay' period
might include unnecessary delay.

The other possibility was to apply the delay at the end of the parallel
apply transaction but that would cause issues related to resource
bloat and locks being held for a long time.

Note that this feature doesn't interact with skip transaction feature.
The skip transaction feature applies to one transaction with a specific LSN.
So, even if the skipped transaction and non-skipped transaction come
consecutively in a very short time, regardless of the order of which comes
first, the time-delayed feature gets balanced by delayed application
for other transactions before and after the skipped transaction.

Author: Euler Taveira, Takamichi Osumi, Kuroda Hayato
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Shveta Malik,
             Kyotaro Horiguchi, Shi Yu, Wang Wei, Dilip Kumar, Melih Mutlu
Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   9 +
 doc/src/sgml/config.sgml                      |  12 ++
 doc/src/sgml/glossary.sgml                    |  14 ++
 doc/src/sgml/logical-replication.sgml         |   6 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  49 ++++-
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/catalog/system_views.sql          |   7 +-
 src/backend/commands/subscriptioncmds.c       | 122 +++++++++++-
 .../replication/logical/applyparallelworker.c |   3 +-
 src/backend/replication/logical/worker.c      | 165 ++++++++++++++--
 src/bin/pg_dump/pg_dump.c                     |  15 +-
 src/bin/pg_dump/pg_dump.h                     |   1 +
 src/bin/psql/describe.c                       |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/replication/worker_internal.h     |   2 +-
 src/test/regress/expected/subscription.out    | 181 +++++++++++-------
 src/test/regress/sql/subscription.sql         |  24 +++
 src/test/subscription/t/001_rep_changes.pl    |  30 +++
 20 files changed, 558 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5dc5ca1133 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7873,6 +7873,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subminapplydelay</structfield> <type>int4</type>
+      </para>
+      <para>
+       The minimum delay, in milliseconds, for applying changes
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subname</structfield> <type>name</type>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d190be1925..626a8b5bd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,18 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        the <filename>postgresql.conf</filename> file or on the server
        command line.
       </para>
+      <para>
+       For time-delayed logical replication, the apply worker sends a feedback
+       message to the publisher every
+       <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+       to set <varname>wal_receiver_status_interval</varname> less than the
+       <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+       <literal>walsender</literal> will repeatedly terminate due to timeout
+       errors. Note that if <varname>wal_receiver_status_interval</varname> is
+       set to zero, the apply worker sends no feedback messages during the
+       <literal>min_apply_delay</literal> period. Refer to
+       <xref linkend="sql-createsubscription"/> for more information.
+      </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..6ed6fa5853 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1729,6 +1729,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-time-delayed-replication">
+   <glossterm>Time-delayed replication</glossterm>
+   <glossdef>
+     <para>
+      Replication setup that applies time-delayed copy of the data.
+    </para>
+    <para>
+     For more information, see
+     <xref linkend="guc-recovery-min-apply-delay"/> for physical replication
+     and <xref linkend="sql-createsubscription"/> for logical replication.
+    </para>
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-toast">
    <glossterm>TOAST</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6bd5f61e2b 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -247,6 +247,12 @@
    target table.
   </para>
 
+  <para>
+   A subscription can delay the application of changes by specifying the
+   <literal>min_apply_delay</literal> subscription parameter. See
+   <xref linkend="sql-createsubscription"/> for details.
+  </para>
+
   <sect2 id="logical-replication-subscription-slot">
    <title>Replication Slot Management</title>
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..8b7eb28e54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -213,8 +213,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
       <literal>binary</literal>, <literal>streaming</literal>,
-      <literal>disable_on_error</literal>, and
-      <literal>origin</literal>.
+      <literal>disable_on_error</literal>,
+      <literal>origin</literal>, and
+      <literal>min_apply_delay</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..1b4b8390af 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -349,7 +349,49 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
-      </variablelist></para>
+
+       <varlistentry>
+        <term><literal>min_apply_delay</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          By default, the subscriber applies changes as soon as possible. This
+          parameter allows the user to delay the application of changes by a
+          given time period. If the value is specified without units, it is
+          taken as milliseconds. The default is zero (no delay). See
+          <xref linkend="config-setting-names-values"/> for details on the
+          available valid time units.
+         </para>
+         <para>
+          Any delay becomes effective only after all initial table
+          synchronization has finished and occurs before each transaction starts
+          to get applied on the subscriber. The delay is calculated as the
+          difference between the WAL timestamp as written on the publisher and
+          the current time on the subscriber. Any overhead of time spent in
+          logical decoding and in transferring the transaction may reduce the
+          actual wait time. It is also possible that the overhead already
+          exceeds the requested <literal>min_apply_delay</literal> value, in
+          which case no delay is applied. If the system clocks on publisher and
+          subscriber are not synchronized, this may lead to apply changes
+          earlier than expected, but this is not a major issue because this
+          parameter is typically much larger than the time deviations between
+          servers. Note that if this parameter is set to a long delay, the
+          replication will stop if the replication slot falls behind the current
+          LSN by more than
+          <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>.
+         </para>
+         <warning>
+           <para>
+            Delaying the replication means there is a much longer time between
+            making a change on the publisher, and that change being committed
+            on the subscriber. This can impact the performance of synchronous
+            replication. See <xref linkend="guc-synchronous-commit"/>
+            parameter.
+           </para>
+         </warning>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
 
     </listitem>
    </varlistentry>
@@ -420,6 +462,11 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
    published with different column lists are not supported.
   </para>
 
+  <para>
+   A non-zero <literal>min_apply_delay</literal> parameter is not allowed when
+   streaming in parallel mode.
+  </para>
+
   <para>
    We allow non-existent publications to be specified so that users can add
    those later. This means
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a56ae311c3..e19e5cbca2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->skiplsn = subform->subskiplsn;
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
+	sub->minapplydelay = subform->subminapplydelay;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..317c2010cb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1299,9 +1299,10 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-              subbinary, substream, subtwophasestate, subdisableonerr,
-              subslotname, subsynccommit, subpublications, suborigin)
+GRANT SELECT (oid, subdbid, subskiplsn, subminapplydelay, subname, subowner,
+              subenabled, subbinary, substream, subtwophasestate,
+              subdisableonerr, subslotname, subsynccommit, subpublications,
+              suborigin)
     ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..82e16fd0f9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,6 +66,7 @@
 #define SUBOPT_DISABLE_ON_ERR		0x00000400
 #define SUBOPT_LSN					0x00000800
 #define SUBOPT_ORIGIN				0x00001000
+#define SUBOPT_MIN_APPLY_DELAY		0x00002000
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -90,6 +91,7 @@ typedef struct SubOpts
 	bool		disableonerr;
 	char	   *origin;
 	XLogRecPtr	lsn;
+	int32		min_apply_delay;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -100,7 +102,7 @@ static void check_publications_origin(WalReceiverConn *wrconn,
 static void check_duplicates_in_publist(List *publist, Datum *datums);
 static List *merge_publications(List *oldpublist, List *newpublist, bool addpub, const char *subname);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
-
+static int32 defGetMinApplyDelay(DefElem *def);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -146,6 +148,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->disableonerr = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY))
+		opts->min_apply_delay = 0;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -324,6 +328,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_LSN;
 			opts->lsn = lsn;
 		}
+		else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+				 strcmp(defel->defname, "min_apply_delay") == 0)
+		{
+			if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				errorConflictingDefElem(defel, pstate);
+
+			opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY;
+			opts->min_apply_delay = defGetMinApplyDelay(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -404,6 +417,32 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 								"slot_name = NONE", "create_slot = false")));
 		}
 	}
+
+	/*
+	 * The combination of parallel streaming mode and min_apply_delay is not
+	 * allowed. This is because in parallel streaming mode, we start applying
+	 * the transaction stream as soon as the first change arrives without
+	 * knowing the transaction's prepare/commit time. This means we cannot
+	 * calculate the underlying network/decoding lag between publisher and
+	 * subscriber, and so always waiting for the full 'min_apply_delay' period
+	 * might include unnecessary delay.
+	 *
+	 * The other possibility was to apply the delay at the end of the parallel
+	 * apply transaction but that would cause issues related to resource bloat
+	 * and locks being held for a long time.
+	 */
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+		opts->min_apply_delay > 0 &&
+		opts->streaming == LOGICALREP_STREAM_PARALLEL)
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+
+		/*
+		 * translator: the first %s is a string of the form "parameter > 0"
+		 * and the second one is "option = value".
+		 */
+				errmsg("%s and %s are mutually exclusive options",
+					   "min_apply_delay > 0", "streaming = parallel"));
 }
 
 /*
@@ -560,7 +599,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
+					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN |
+					  SUBOPT_MIN_APPLY_DELAY);
 	parse_subscription_options(pstate, stmt->options, supported_opts, &opts);
 
 	/*
@@ -625,6 +665,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
+	values[Anum_pg_subscription_subminapplydelay - 1] = Int32GetDatum(opts.min_apply_delay);
 	values[Anum_pg_subscription_subname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
@@ -1054,7 +1095,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 				supported_opts = (SUBOPT_SLOT_NAME |
 								  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 								  SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR |
-								  SUBOPT_ORIGIN);
+								  SUBOPT_ORIGIN | SUBOPT_MIN_APPLY_DELAY);
 
 				parse_subscription_options(pstate, stmt->options,
 										   supported_opts, &opts);
@@ -1098,6 +1139,19 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.streaming == LOGICALREP_STREAM_PARALLEL &&
+						!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)
+						&& sub->minapplydelay > 0)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set parallel streaming mode for subscription with %s",
+									   "min_apply_delay"));
+
 					values[Anum_pg_subscription_substream - 1] =
 						CharGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -1111,6 +1165,26 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 						= true;
 				}
 
+				if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.min_apply_delay > 0 &&
+						!IsSet(opts.specified_opts, SUBOPT_STREAMING)
+						&& sub->stream == LOGICALREP_STREAM_PARALLEL)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set %s for subscription in parallel streaming mode",
+									   "min_apply_delay"));
+
+					values[Anum_pg_subscription_subminapplydelay - 1] =
+						Int32GetDatum(opts.min_apply_delay);
+					replaces[Anum_pg_subscription_subminapplydelay - 1] = true;
+				}
+
 				if (IsSet(opts.specified_opts, SUBOPT_ORIGIN))
 				{
 					values[Anum_pg_subscription_suborigin - 1] =
@@ -2195,3 +2269,45 @@ defGetStreamingMode(DefElem *def)
 					def->defname)));
 	return LOGICALREP_STREAM_OFF;	/* keep compiler quiet */
 }
+
+/*
+ * Extract the min_apply_delay value from a DefElem. This is very similar to
+ * parse_and_validate_value() for integer values, because min_apply_delay
+ * accepts the same parameter format as recovery_min_apply_delay.
+ */
+static int32
+defGetMinApplyDelay(DefElem *def)
+{
+	char	   *input_string;
+	int			result;
+	const char *hintmsg;
+
+	input_string = defGetString(def);
+
+	/*
+	 * Parse given string as parameter which has millisecond unit
+	 */
+	if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for parameter \"%s\": \"%s\"",
+						"min_apply_delay", input_string),
+				 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+
+	/*
+	 * Check both the lower boundary for the valid min_apply_delay range and
+	 * the upper boundary as the safeguard for some platforms where INT_MAX is
+	 * wider than int32 respectively. Although parse_int() has confirmed that
+	 * the result is less than or equal to INT_MAX, the value will be stored
+	 * in a catalog column of int32.
+	 */
+	if (result < 0 || result > PG_INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)",
+						result,
+						"min_apply_delay",
+						0, PG_INT32_MAX)));
+
+	return result;
+}
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index da437e0bc3..32db20fd98 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void)
 	{
 		apply_spooled_messages(&MyParallelShared->fileset,
 							   MyParallelShared->xid,
-							   InvalidXLogRecPtr);
+							   InvalidXLogRecPtr,
+							   0);
 		pa_set_fileset_state(MyParallelShared, FS_EMPTY);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6248..c574531040 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -319,6 +319,17 @@ static List *on_commit_wakeup_workers_subids = NIL;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/*
+ * In order to avoid walsender timeout for time-delayed logical replication the
+ * apply worker keeps sending feedback messages during the delay period.
+ * Meanwhile, the feature delays the apply before the start of the
+ * transaction and thus we don't write WAL records for the suspended changes
+ * during the wait. When the apply worker sends a feedback message during the
+ * delay, we should not overwrite positions of the flushed and apply LSN by the
+ * last received latest LSN. See send_feedback() for details.
+ */
+static XLogRecPtr last_received = InvalidXLogRecPtr;
+
 /* fields valid only when processing streamed transaction */
 static bool in_streamed_transaction = false;
 
@@ -389,7 +400,8 @@ static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
 static void stream_close_file(void);
 
-static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+						  bool has_unprocessed_change);
 
 static void DisableSubscriptionAndExit(void);
 
@@ -999,6 +1011,109 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 	ExecStoreVirtualTuple(slot);
 }
 
+/*
+ * When min_apply_delay parameter is set on the subscriber, we wait long enough
+ * to make sure a transaction is applied at least that period behind the
+ * publisher.
+ *
+ * While the physical replication applies the delay at commit time, this
+ * feature applies the delay for the next transaction but before starting the
+ * transaction. This is mainly because keeping a transaction that conducted
+ * write operations open for a long time results in some issues such as bloat
+ * and locks.
+ *
+ * The min_apply_delay parameter will take effect only after all tables are in
+ * READY state.
+ *
+ * xid is the transaction id where we apply the delay.
+ *
+ * finish_ts is the commit/prepare time of both regular (non-streamed) and
+ * streamed transactions. Unlike the regular (non-streamed) cases, the delay
+ * is applied in a STREAM COMMIT/STREAM PREPARE message for streamed
+ * transactions. The STREAM START message does not contain a commit/prepare
+ * time (it will be available when the in-progress transaction finishes).
+ * Hence, it's not appropriate to apply a delay at the STREAM START time.
+ */
+static void
+maybe_apply_delay(TransactionId xid, TimestampTz finish_ts)
+{
+	Assert(finish_ts > 0);
+
+	/* Nothing to do if no delay set */
+	if (!MySubscription->minapplydelay)
+		return;
+
+	/*
+	 * The min_apply_delay parameter is ignored until all tablesync workers
+	 * have reached READY state. This is because if we allowed the delay
+	 * during the catchup phase, then once we reached the limit of tablesync
+	 * workers it would impose a delay for each subsequent worker. That would
+	 * cause initial table synchronization completion to take a long time.
+	 */
+	if (!AllTablesyncsReady())
+		return;
+
+	/* Apply the delay by the latch mechanism */
+	while (true)
+	{
+		TimestampTz delayUntil;
+		long		diffms;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* This might change wal_receiver_status_interval */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/*
+		 * Before calculating the time duration, reload the catalog if needed.
+		 */
+		if (!in_remote_transaction && !in_streamed_transaction)
+		{
+			AcceptInvalidationMessages();
+			maybe_reread_subscription();
+		}
+
+		delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay);
+		diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+		/*
+		 * Exit without arming the latch if it's already past time to apply
+		 * this transaction.
+		 */
+		if (diffms <= 0)
+			break;
+
+		elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms",
+			 xid, MySubscription->minapplydelay, diffms);
+
+		/*
+		 * Call send_feedback() to prevent the publisher from exiting by
+		 * timeout during the delay, when wal_receiver_status_interval is
+		 * available.
+		 */
+		if (wal_receiver_status_interval > 0 &&
+			diffms > wal_receiver_status_interval * 1000L)
+		{
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  wal_receiver_status_interval * 1000L,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+			send_feedback(last_received, true, false, true);
+		}
+		else
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  diffms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+	}
+}
+
 /*
  * Handle BEGIN message.
  */
@@ -1013,6 +1128,9 @@ apply_handle_begin(StringInfo s)
 	logicalrep_read_begin(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
 
+	/* Should we delay the current transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.committime);
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	maybe_start_skipping_changes(begin_data.final_lsn);
@@ -1070,6 +1188,9 @@ apply_handle_begin_prepare(StringInfo s)
 	logicalrep_read_begin_prepare(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
 
+	/* Should we delay the current prepared transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.prepare_time);
+
 	remote_final_lsn = begin_data.prepare_lsn;
 
 	maybe_start_skipping_changes(begin_data.prepare_lsn);
@@ -1317,7 +1438,8 @@ apply_handle_stream_prepare(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
-								   prepare_data.xid, prepare_data.prepare_lsn);
+								   prepare_data.xid, prepare_data.prepare_lsn,
+								   prepare_data.prepare_time);
 
 			/* Mark the transaction as prepared. */
 			apply_handle_prepare_internal(&prepare_data);
@@ -2011,10 +2133,13 @@ ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno,
 
 /*
  * Common spoolfile processing.
+ *
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */
 void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-					   XLogRecPtr lsn)
+					   XLogRecPtr lsn, TimestampTz finish_ts)
 {
 	StringInfoData s2;
 	int			nchanges;
@@ -2025,6 +2150,10 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
 	int			fileno;
 	off_t		offset;
 
+	/* Should we delay the current transaction? */
+	if (finish_ts)
+		maybe_apply_delay(xid, finish_ts);
+
 	if (!am_parallel_apply_worker())
 		maybe_start_skipping_changes(lsn);
 
@@ -2174,7 +2303,7 @@ apply_handle_stream_commit(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
-								   commit_data.commit_lsn);
+								   commit_data.commit_lsn, commit_data.committime);
 
 			apply_handle_commit_internal(&commit_data);
 
@@ -3447,7 +3576,7 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
  * Apply main loop.
  */
 static void
-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)
 {
 	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
 	bool		ping_sent = false;
@@ -3568,7 +3697,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						if (last_received < end_lsn)
 							last_received = end_lsn;
 
-						send_feedback(last_received, reply_requested, false);
+						send_feedback(last_received, reply_requested, false, false);
 						UpdateWorkerStats(last_received, timestamp, true);
 					}
 					/* other message types are purposefully ignored */
@@ -3581,7 +3710,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		}
 
 		/* confirm all writes so far */
-		send_feedback(last_received, false, false);
+		send_feedback(last_received, false, false, false);
 
 		if (!in_remote_transaction && !in_streamed_transaction)
 		{
@@ -3678,7 +3807,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				}
 			}
 
-			send_feedback(last_received, requestReply, requestReply);
+			send_feedback(last_received, requestReply, requestReply, false);
 
 			/*
 			 * Force reporting to ensure long idle periods don't lead to
@@ -3708,7 +3837,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  * to send a response to avoid timeouts.
  */
 static void
-send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
+send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, bool has_unprocessed_change)
 {
 	static StringInfo reply_message = NULL;
 	static TimestampTz send_time = 0;
@@ -3738,8 +3867,14 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	/*
 	 * No outstanding transactions to flush, we can report the latest received
 	 * position. This is important for synchronous replication.
+	 *
+	 * If the logical replication subscription has unprocessed changes then do
+	 * not inform the publisher that the received latest LSN is already
+	 * applied and flushed, otherwise, the publisher will make a wrong
+	 * assumption about the logical replication progress. Instead, just send a
+	 * feedback message to avoid a replication timeout during the delay.
 	 */
-	if (!have_pending_txes)
+	if (!have_pending_txes && !has_unprocessed_change)
 		flushpos = writepos = recvpos;
 
 	if (writepos < last_writepos)
@@ -3776,8 +3911,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
-	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
+	elog(DEBUG2, "sending feedback (force %d, has_unprocessed_change %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
+		 has_unprocessed_change,
 		 LSN_FORMAT_ARGS(recvpos),
 		 LSN_FORMAT_ARGS(writepos),
 		 LSN_FORMAT_ARGS(flushpos));
@@ -4367,11 +4503,11 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
  * of system resource error and are not repeatable.
  */
 static void
-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)
 {
 	PG_TRY();
 	{
-		LogicalRepApplyLoop(origin_startpos);
+		LogicalRepApplyLoop();
 	}
 	PG_CATCH();
 	{
@@ -4661,7 +4797,8 @@ ApplyWorkerMain(Datum main_arg)
 	}
 
 	/* Run the main loop. */
-	start_apply(origin_startpos);
+	last_received = origin_startpos;
+	start_apply();
 
 	proc_exit(0);
 }
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..1e87f0124e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4494,6 +4494,7 @@ getSubscriptions(Archive *fout)
 	int			i_subsynccommit;
 	int			i_subpublications;
 	int			i_subbinary;
+	int			i_subminapplydelay;
 	int			i,
 				ntups;
 
@@ -4546,9 +4547,13 @@ getSubscriptions(Archive *fout)
 						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	if (fout->remoteVersion >= 160000)
-		appendPQExpBufferStr(query, " s.suborigin\n");
+		appendPQExpBufferStr(query,
+							 " s.suborigin,\n"
+							 " s.subminapplydelay\n");
 	else
-		appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY);
+		appendPQExpBuffer(query, " '%s' AS suborigin,\n"
+						  " 0 AS subminapplydelay\n",
+						  LOGICALREP_ORIGIN_ANY);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4576,6 +4581,7 @@ getSubscriptions(Archive *fout)
 	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 	i_subdisableonerr = PQfnumber(res, "subdisableonerr");
 	i_suborigin = PQfnumber(res, "suborigin");
+	i_subminapplydelay = PQfnumber(res, "subminapplydelay");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4606,6 +4612,8 @@ getSubscriptions(Archive *fout)
 		subinfo[i].subdisableonerr =
 			pg_strdup(PQgetvalue(res, i, i_subdisableonerr));
 		subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin));
+		subinfo[i].subminapplydelay =
+			atoi(PQgetvalue(res, i, i_subminapplydelay));
 
 		/* Decide whether we want to dump it */
 		selectDumpableObject(&(subinfo[i].dobj), fout);
@@ -4687,6 +4695,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
+	if (subinfo->subminapplydelay > 0)
+		appendPQExpBuffer(query, ", min_apply_delay = '%d ms'", subinfo->subminapplydelay);
+
 	appendPQExpBufferStr(query, ");\n");
 
 	if (subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..b8831c3ed3 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo
 	char	   *subdisableonerr;
 	char	   *suborigin;
 	char	   *subsynccommit;
+	int			subminapplydelay;
 	char	   *subpublications;
 } SubscriptionInfo;
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c8a0bb7b3a..81d4607a1c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6472,7 +6472,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false, false, false, false, false};
+	false, false, false, false, false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6527,10 +6527,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Two-phase commit"),
 							  gettext_noop("Disable on error"));
 
+		/* Origin and min_apply_delay are only supported in v16 and higher */
 		if (pset.sversion >= 160000)
 			appendPQExpBuffer(&buf,
-							  ", suborigin AS \"%s\"\n",
-							  gettext_noop("Origin"));
+							  ", suborigin AS \"%s\"\n"
+							  ", subminapplydelay AS \"%s\"\n",
+							  gettext_noop("Origin"),
+							  gettext_noop("Min apply delay"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e1882eaea..e8b9a43a47 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1925,7 +1925,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "PUBLICATION");
 	/* ALTER SUBSCRIPTION <name> SET ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SET", "("))
-		COMPLETE_WITH("binary", "disable_on_error", "origin", "slot_name",
+		COMPLETE_WITH("binary", "disable_on_error", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit");
 	/* ALTER SUBSCRIPTION <name> SKIP ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SKIP", "("))
@@ -3268,7 +3268,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-					  "disable_on_error", "enabled", "origin", "slot_name",
+					  "disable_on_error", "enabled", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index b0f2a1705d..d1cfefc6d6 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	Oid			subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
 
+	int32		subminapplydelay;	/* Replication apply delay (ms) */
+
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
@@ -122,6 +124,7 @@ typedef struct Subscription
 								 * skipped */
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
+	int32		minapplydelay;	/* Replication apply delay (ms) */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index dc87a4edd1..3dc09d1a4c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -255,7 +255,7 @@ extern void stream_stop_internal(TransactionId xid);
 
 /* Common streaming function to apply all the spooled messages */
 extern void apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-								   XLogRecPtr lsn);
+								   XLogRecPtr lsn, TimestampTz finish_ts);
 
 extern void apply_dispatch(StringInfo s);
 
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 3f99b14394..cf8e727ee9 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -114,18 +114,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION 'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub4 SET (origin = any);
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub3;
@@ -143,10 +143,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -163,10 +163,10 @@ ERROR:  unrecognized subscription parameter: "create_slot"
 -- ok
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/12345');
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/12345
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/12345
 (1 row)
 
 -- ok - with lsn = NONE
@@ -175,10 +175,10 @@ ALTER SUBSCRIPTION regress_testsub SKIP (lsn = NONE);
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/0');
 ERROR:  invalid WAL location (LSN): 0/0
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/0
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 BEGIN;
@@ -210,10 +210,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                                               List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
----------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | local              | dbname=regress_doesnotexist2 | 0/0
+                                                                                                        List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | local              | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 -- rename back to keep the rest simple
@@ -247,19 +247,19 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -271,27 +271,27 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication already exists
@@ -306,10 +306,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                                                 List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication used more than once
@@ -324,10 +324,10 @@ ERROR:  publication "testpub3" is not in subscription "regress_testsub"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -363,10 +363,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 --fail - alter of two_phase option not supported.
@@ -375,10 +375,10 @@ ERROR:  unrecognized subscription parameter: "two_phase"
 -- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -388,10 +388,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -404,20 +404,57 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+ERROR:  invalid value for parameter "min_apply_delay": "foo"
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+ERROR:  -1 ms is outside the valid range for parameter "min_apply_delay" (0 .. 2147483647)
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+ERROR:  min_apply_delay > 0 and streaming = parallel are mutually exclusive options
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |             123 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |        86400000 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+ERROR:  cannot set parallel streaming mode for subscription with min_apply_delay
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ERROR:  cannot set min_apply_delay for subscription in parallel streaming mode
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 7281f5fee2..7317b140f5 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -286,6 +286,30 @@ ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+\dRs+
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 91aa068c95..f94819672b 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -515,6 +515,36 @@ $node_publisher->poll_query_until('postgres',
   or die
   "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
 
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# look the time duration between tuples are inserted on publisher and then
+# changes are replicated on subscriber.
+my $delay = 3;
+
+# Set min_apply_delay parameter to 3 seconds
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+
+# Make new content on publisher and check its presence in subscriber depending
+# on the delay applied above. Before doing the insertion, get the
+# current timestamp that will be used as a comparison base. Even on slow
+# machines, this allows to have a predictable behavior when comparing the
+# delay between data insertion moment on publisher and replay time on subscriber.
+my $publisher_insert_time = time();
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_ins VALUES (generate_series(1101, 1120))");
+
+# The publisher waits for the replication to complete
+$node_publisher->wait_for_catchup('tap_sub_renamed');
+
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+	"subscriber applies WAL only after replication delay for non-streaming transaction"
+);
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
-- 
2.27.0

v3-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchapplication/octet-stream; name=v3-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchDownload
From 3ebe9bc1a11ecb9f4ae3d1b214c7eaea2a0c724d Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 7 Feb 2023 05:38:20 +0000
Subject: [PATCH v3 2/2] Extend START_REPLICATION command to accept walsender
 options

This commit extends START_REPLICATION to accept options of walsender. Currently,
only one option exit_before_confirming is accepted.

For physical replication, the grammer of START_REPLICATION is extended to accept
options. Note that in the normal phyical replication the added option is never
used.

For logical replication, the option list for logical decoding plugin is reused for
storing walsender options. When the min_apply_delay parameter is set for a
subscription, the apply worker related with it will send START_REPLICATION query
with exit_before_confirming = true to publisher node.

This option allows primay servers to shut down even if there are pending WALs to
be sent or sent WALs are not flushed on the secondary. This may be useful to
shut down the primary even when the walreceiver/worker is stuck.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/protocol.sgml                    | 21 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c       |  4 +
 src/backend/replication/logical/worker.c      | 14 ++-
 src/backend/replication/repl_gram.y           |  8 +-
 src/backend/replication/walsender.c           | 87 ++++++++++++++++++-
 src/include/replication/walreceiver.h         |  1 +
 src/test/subscription/t/001_rep_changes.pl    | 10 ++-
 src/tools/pgindent/typedefs.list              |  1 +
 8 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 93fc7167d4..9c84d57cfb 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2192,7 +2192,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
     </varlistentry>
 
     <varlistentry id="protocol-replication-start-replication">
-     <term><literal>START_REPLICATION</literal> [ <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> ] [ <literal>PHYSICAL</literal> ] <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>TIMELINE</literal> <replaceable class="parameter">tli</replaceable> ]
+     <term><literal>START_REPLICATION</literal> [ <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> ] [ <literal>PHYSICAL</literal> ] <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>TIMELINE</literal> <replaceable class="parameter">tli</replaceable> ] [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]
       <indexterm><primary>START_REPLICATION</primary></indexterm>
      </term>
      <listitem>
@@ -2496,6 +2496,25 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
         </listitem>
        </varlistentry>
       </variablelist>
+
+      <para>
+       If further options are given, we can control the behavior of the
+       walsender more detailed. Currently the following option is accepted:
+      </para>
+
+      <variablelist>
+       <varlistentry>
+        <term>exit_before_confirming</term>
+        <listitem>
+         <para>
+          If set to true, the walsender will exit before confirming the remote
+          flush of WALs at shutdown. This can be useful when the network lag
+          between nodes are large and it takes time to shut down the server.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+
      </listitem>
     </varlistentry>
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 560ec974fa..8bf8e03063 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -443,6 +443,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", binary 'true'");
 
+		if (options->proto.logical.exit_before_confirming &&
+			PQserverVersion(conn->streamConn) >= 160000)
+			appendStringInfoString(&cmd, ", exit_before_confirming 'true'");
+
 		appendStringInfoChar(&cmd, ')');
 	}
 	else
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c574531040..6e98223d52 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4034,7 +4034,9 @@ maybe_reread_subscription(void)
 		newsub->stream != MySubscription->stream ||
 		strcmp(newsub->origin, MySubscription->origin) != 0 ||
 		newsub->owner != MySubscription->owner ||
-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		(newsub->minapplydelay > 0 && MySubscription->minapplydelay == 0) ||
+		(newsub->minapplydelay == 0 && MySubscription->minapplydelay > 0))
 	{
 		if (am_parallel_apply_worker())
 			ereport(LOG,
@@ -4753,9 +4755,19 @@ ApplyWorkerMain(Datum main_arg)
 
 	options.proto.logical.twophase = false;
 	options.proto.logical.origin = pstrdup(MySubscription->origin);
+	options.proto.logical.exit_before_confirming = false;
 
 	if (!am_tablesync_worker())
 	{
+		/*
+		 * time-delayed logical replication does not support tablesync
+		 * workers, so only the leader apply worker can request walsenders to
+		 * exit before confirming remote flush.
+		 */
+		if (server_version >= 160000)
+			options.proto.logical.exit_before_confirming =
+				MySubscription->minapplydelay > 0;
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..1705d52a58 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -91,6 +91,7 @@ Node *replication_parse_result;
 %type <boolval>	opt_temporary
 %type <list>	create_slot_options create_slot_legacy_opt_list
 %type <defelt>	create_slot_legacy_opt
+%type <list>	walsender_options
 
 %%
 
@@ -261,7 +262,7 @@ drop_replication_slot:
  * START_REPLICATION [SLOT slot] [PHYSICAL] %X/%X [TIMELINE %d]
  */
 start_replication:
-			K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline
+			K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline walsender_options
 				{
 					StartReplicationCmd *cmd;
 
@@ -270,6 +271,7 @@ start_replication:
 					cmd->slotname = $2;
 					cmd->startpoint = $4;
 					cmd->timeline = $5;
+					cmd->options = $6;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -336,6 +338,10 @@ opt_timeline:
 				| /* EMPTY */			{ $$ = 0; }
 			;
 
+walsender_options:
+			'(' generic_option_list ')'			{ $$ = $2; }
+			| /* EMPTY */					{ $$ = NIL; }
+		;
 
 plugin_options:
 			'(' plugin_opt_list ')'			{ $$ = $2; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..bcc7c38080 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -219,6 +219,22 @@ typedef struct
 
 static LagTracker *lag_tracker;
 
+/*
+ * If set to true, the walsender will exit before confirming flush of remote
+ * WALs and whether the send buffer is empty.
+ */
+static bool exit_before_confirming = false;
+
+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+	bool		exit_before_confirming;
+} WalSndData;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -260,6 +276,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+static void ConsumeWalsenderOptions(List *options, WalSndData *data);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -672,6 +689,7 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 	TimeLineID	FlushTLI;
+	WalSndData	data = {0};
 
 	/* create xlogreader for physical replication */
 	xlogreader =
@@ -710,6 +728,12 @@ StartReplication(StartReplicationCmd *cmd)
 		 */
 	}
 
+	/* Check given options and set flags accordingly */
+	ConsumeWalsenderOptions(cmd->options, &data);
+
+	if (data.exit_before_confirming)
+		exit_before_confirming = true;
+
 	/*
 	 * Select the timeline. If it was given explicitly by the client, use
 	 * that. Otherwise use the timeline of the last replayed record.
@@ -1245,6 +1269,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 {
 	StringInfoData buf;
 	QueryCompletion qc;
+	WalSndData	data = {0};
 
 	/* make sure that our requirements are still fulfilled */
 	CheckLogicalDecodingRequirements();
@@ -1272,6 +1297,12 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/* Check given options and set flags accordingly */
+	ConsumeWalsenderOptions(cmd->options, &data);
+
+	if (data.exit_before_confirming)
+		exit_before_confirming = true;
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1450,6 +1481,9 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		if (exit_before_confirming)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -3118,15 +3152,16 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	if (WalSndCaughtUp &&
+		(exit_before_confirming ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
@@ -3849,3 +3884,49 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Reads all entrly of the list and consume if needed.
+ *
+ * In logical replication mode, the given list may contain both walsender and
+ * output_plugin options, and it leads "unrecognized pgoutput option" ERROR.
+ * Therefore, the entry for walsender options will be eliminated from the list
+ * if we found.
+ */
+static void
+ConsumeWalsenderOptions(List *options, WalSndData *data)
+{
+	ListCell   *lc;
+	bool		exit_before_confirming_given = false;
+
+	foreach(lc, options)
+	{
+		DefElem    *defel = (DefElem *) lfirst(lc);
+
+		Assert(defel->arg == NULL || IsA(defel->arg, String));
+
+		/* Check each param, whether or not we recognize it */
+		if (strcmp(defel->defname, "exit_before_confirming") == 0)
+		{
+			if (exit_before_confirming_given)
+				ereport(ERROR,
+						errcode(ERRCODE_SYNTAX_ERROR),
+						errmsg("conflicting or redundant options"));
+			exit_before_confirming_given = true;
+
+			data->exit_before_confirming = defGetBoolean(defel);
+
+			/*
+			 * Elimitates current element, because the list may be bypassed to
+			 * the pgoutput module and it will raise an ERROR due to the
+			 * unrecognized option.
+			 */
+			options = foreach_delete_current(options, lc);
+		}
+
+		/*
+		 * ERROR is not raised here even if the given parameter is not known,
+		 * because it may be written for the output plugin.
+		 */
+	}
+}
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index decffe352d..f801fb3e0d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -187,6 +187,7 @@ typedef struct
 									 * prepare time */
 			char	   *origin; /* Only publish data originating from the
 								 * specified origin */
+			bool		exit_before_confirming;
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index f94819672b..d7a6fd0e38 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -523,9 +523,17 @@ $node_publisher->poll_query_until('postgres',
 # changes are replicated on subscriber.
 my $delay = 3;
 
-# Set min_apply_delay parameter to 3 seconds
+# check restart on changing min_apply_delay to 3 seconds
+$oldpid = $node_publisher->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+);
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for apply to restart after changing min_apply_delay to non-zero value";
 
 # Make new content on publisher and check its presence in subscriber depending
 # on the delay applied above. Before doing the insertion, get the
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..3b7f8eb063 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2968,6 +2968,7 @@ WalReceiverConn
 WalReceiverFunctionsType
 WalSnd
 WalSndCtlData
+WalSndData
 WalSndSendDataCallback
 WalSndState
 WalTimeSample
-- 
2.27.0

#43Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#42)
Re: Exit walsender before confirming remote flush in logical replication

I agree to the direction and thanks for the patch.

At Tue, 7 Feb 2023 17:08:54 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in

I noticed that previous ones are rejected by cfbot, even if they passed on my
environment...
PSA fixed version.

While analyzing more, I found the further bug that forgets initialization.
PSA new version that could be passed automated tests on my github repository.
Sorry for noise.

0002:

This patch doesn't seem to offer a means to change the default
walsender behavior. We need a subscription option named like
"walsender_exit_mode" to do that.

+ConsumeWalsenderOptions(List *options, WalSndData *data)

I wonder if it is the right design to put options for different things
into a single list. I rather choose to embed the walsender option in
the syntax than needing this function.

K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline opt_shutdown_mode

K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options

where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".

======
If we go with the current design, I think it is better to share the
option list rule between the logical and physical START_REPLCIATION
commands.

I'm not sure I like the option syntax
"exit_before_confirming=<Boolean>". I imagin that other options may
come in future. Thus, how about "walsender_shutdown_mode=<mode>",
where the mode is one of "wait_flush"(default) and "immediate"?

+typedef struct
+{
+	bool		exit_before_confirming;
+} WalSndData;

Data doesn't seem to represent the variable. Why not WalSndOptions?

-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		(newsub->minapplydelay > 0 && MySubscription->minapplydelay == 0) ||
+		(newsub->minapplydelay == 0 && MySubscription->minapplydelay > 0))

I slightly prefer the following expression (Others may disagree:p):

((newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))

And I think we need a comment for the term. For example,

/* minapplydelay affects START_REPLICATION option exit_before_confirming */

+ * Reads all entrly of the list and consume if needed.
s/entrly/entries/ ?
...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#43)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Feb 8, 2023 at 7:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

I agree to the direction and thanks for the patch.

At Tue, 7 Feb 2023 17:08:54 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in

I noticed that previous ones are rejected by cfbot, even if they passed on my
environment...
PSA fixed version.

While analyzing more, I found the further bug that forgets initialization.
PSA new version that could be passed automated tests on my github repository.
Sorry for noise.

0002:

This patch doesn't seem to offer a means to change the default
walsender behavior. We need a subscription option named like
"walsender_exit_mode" to do that.

I don't think at this stage we need a subscription-level option, we
can extend it later if this is really useful for users. For now, we
can set this new option when min_apply_delay > 0.

+ConsumeWalsenderOptions(List *options, WalSndData *data)

I wonder if it is the right design to put options for different things
into a single list. I rather choose to embed the walsender option in
the syntax than needing this function.

K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline opt_shutdown_mode

K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options

where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".

The other option could have been that we just add it as a
plugin_option for logical replication but it doesn't seem to match
with the other plugin options. I think it would be better to have it
as a separate option something like opt_shutdown_immediate and extend
the logical replication syntax for now. We can later extend physical
replication syntax when we want to expose such an option via physical
replication.

--
With Regards,
Amit Kapila.

#45Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#44)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Thanks for giving comments!

0002:

This patch doesn't seem to offer a means to change the default
walsender behavior. We need a subscription option named like
"walsender_exit_mode" to do that.

I don't think at this stage we need a subscription-level option, we
can extend it later if this is really useful for users. For now, we
can set this new option when min_apply_delay > 0.

Agreed. I wanted to keep the feature closed for PG16 and then will extend if needed.

+ConsumeWalsenderOptions(List *options, WalSndData *data)

I wonder if it is the right design to put options for different things
into a single list. I rather choose to embed the walsender option in
the syntax than needing this function.

K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline

opt_shutdown_mode

K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR

opt_shutdown_mode plugin_options

where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".

The other option could have been that we just add it as a
plugin_option for logical replication but it doesn't seem to match
with the other plugin options. I think it would be better to have it
as a separate option something like opt_shutdown_immediate and extend
the logical replication syntax for now. We can later extend physical
replication syntax when we want to expose such an option via physical
replication.

The main intention for us is to shut down logical walsenders. Therefore, same as above,
I want to develop the feature for logical replication once and then try to extend if we want.
TBH I think adding physicalrep support seems not to be so hard,
but I want to keep the patch smaller.

The new patch will be attached soon in another mail.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#46Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#43)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thank you for checking the patch! PSA new version.

0002:

This patch doesn't seem to offer a means to change the default
walsender behavior. We need a subscription option named like
"walsender_exit_mode" to do that.

As I said in another mail[1]/messages/by-id/TYAPR01MB5866D3EC780D251953BDE7FAF5D89@TYAPR01MB5866.jpnprd01.prod.outlook.com, I'm thinking the feature does not have to be
used alone for now.

+ConsumeWalsenderOptions(List *options, WalSndData *data)

I wonder if it is the right design to put options for different things
into a single list. I rather choose to embed the walsender option in
the syntax than needing this function.

K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline
opt_shutdown_mode

K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR
opt_shutdown_mode plugin_options

where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".

Right, the option handling was quite bad. I added new syntax opt_shutdown_mode
to logical replication. And many codes were modified accordingly.
Note that based on the other discussion, I removed codes
for supporting physical replication but tried to keep the extensibility.

======
If we go with the current design, I think it is better to share the
option list rule between the logical and physical START_REPLCIATION
commands.

I'm not sure I like the option syntax
"exit_before_confirming=<Boolean>". I imagin that other options may
come in future. Thus, how about "walsender_shutdown_mode=<mode>",
where the mode is one of "wait_flush"(default) and "immediate"?

Seems better, I changed to from boolean to enumeration.

+typedef struct
+{
+	bool		exit_before_confirming;
+} WalSndData;

Data doesn't seem to represent the variable. Why not WalSndOptions?

This is inspired by PGOutputData, but I prefer your idea. Fixed.

-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		(newsub->minapplydelay > 0 &&
MySubscription->minapplydelay == 0) ||
+		(newsub->minapplydelay == 0 &&
MySubscription->minapplydelay > 0))

I slightly prefer the following expression (Others may disagree:p):

((newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))

I think conditions for the same parameter should be aligned one line,
So your posted seems better. Fixed.

And I think we need a comment for the term. For example,

/* minapplydelay affects START_REPLICATION option exit_before_confirming
*/

Added just above the condition.

+ * Reads all entrly of the list and consume if needed.
s/entrly/entries/ ?
...

This part is no longer needed.

[1]: /messages/by-id/TYAPR01MB5866D3EC780D251953BDE7FAF5D89@TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v4-0001-Time-delayed-logical-replication-subscriber.patchapplication/octet-stream; name=v4-0001-Time-delayed-logical-replication-subscriber.patchDownload
From 1580af48dba4e830ac0cbfc012c40f295a2c33b7 Mon Sep 17 00:00:00 2001
From: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Date: Tue, 7 Feb 2023 13:05:34 +0000
Subject: [PATCH v4 1/2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

This patch implements a new subscription parameter called 'min_apply_delay'.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs before we start to apply the transaction on the
subscriber. The main reason is to avoid keeping a transaction open for
a long time. Regular and prepared transactions are covered. Streamed
transactions are also covered.

The combination of parallel streaming mode and min_apply_delay is not
allowed. This is because in parallel streaming mode, we start applying
the transaction stream as soon as the first change arrives without
knowing the transaction's prepare/commit time. This means we cannot
calculate the underlying network/decoding lag between publisher and
subscriber, and so always waiting for the full 'min_apply_delay' period
might include unnecessary delay.

The other possibility was to apply the delay at the end of the parallel
apply transaction but that would cause issues related to resource
bloat and locks being held for a long time.

Note that this feature doesn't interact with skip transaction feature.
The skip transaction feature applies to one transaction with a specific LSN.
So, even if the skipped transaction and non-skipped transaction come
consecutively in a very short time, regardless of the order of which comes
first, the time-delayed feature gets balanced by delayed application
for other transactions before and after the skipped transaction.

Author: Euler Taveira, Takamichi Osumi, Kuroda Hayato
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Shveta Malik,
             Kyotaro Horiguchi, Shi Yu, Wang Wei, Dilip Kumar, Melih Mutlu
Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   9 +
 doc/src/sgml/config.sgml                      |  12 ++
 doc/src/sgml/glossary.sgml                    |  14 ++
 doc/src/sgml/logical-replication.sgml         |   6 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  49 ++++-
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/catalog/system_views.sql          |   7 +-
 src/backend/commands/subscriptioncmds.c       | 122 +++++++++++-
 .../replication/logical/applyparallelworker.c |   3 +-
 src/backend/replication/logical/worker.c      | 165 ++++++++++++++--
 src/bin/pg_dump/pg_dump.c                     |  15 +-
 src/bin/pg_dump/pg_dump.h                     |   1 +
 src/bin/psql/describe.c                       |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/replication/worker_internal.h     |   2 +-
 src/test/regress/expected/subscription.out    | 181 +++++++++++-------
 src/test/regress/sql/subscription.sql         |  24 +++
 src/test/subscription/t/001_rep_changes.pl    |  30 +++
 20 files changed, 558 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5dc5ca1133 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7873,6 +7873,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subminapplydelay</structfield> <type>int4</type>
+      </para>
+      <para>
+       The minimum delay, in milliseconds, for applying changes
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subname</structfield> <type>name</type>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d190be1925..626a8b5bd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,18 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        the <filename>postgresql.conf</filename> file or on the server
        command line.
       </para>
+      <para>
+       For time-delayed logical replication, the apply worker sends a feedback
+       message to the publisher every
+       <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+       to set <varname>wal_receiver_status_interval</varname> less than the
+       <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+       <literal>walsender</literal> will repeatedly terminate due to timeout
+       errors. Note that if <varname>wal_receiver_status_interval</varname> is
+       set to zero, the apply worker sends no feedback messages during the
+       <literal>min_apply_delay</literal> period. Refer to
+       <xref linkend="sql-createsubscription"/> for more information.
+      </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..6ed6fa5853 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1729,6 +1729,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-time-delayed-replication">
+   <glossterm>Time-delayed replication</glossterm>
+   <glossdef>
+     <para>
+      Replication setup that applies time-delayed copy of the data.
+    </para>
+    <para>
+     For more information, see
+     <xref linkend="guc-recovery-min-apply-delay"/> for physical replication
+     and <xref linkend="sql-createsubscription"/> for logical replication.
+    </para>
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-toast">
    <glossterm>TOAST</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6bd5f61e2b 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -247,6 +247,12 @@
    target table.
   </para>
 
+  <para>
+   A subscription can delay the application of changes by specifying the
+   <literal>min_apply_delay</literal> subscription parameter. See
+   <xref linkend="sql-createsubscription"/> for details.
+  </para>
+
   <sect2 id="logical-replication-subscription-slot">
    <title>Replication Slot Management</title>
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..8b7eb28e54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -213,8 +213,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
       <literal>binary</literal>, <literal>streaming</literal>,
-      <literal>disable_on_error</literal>, and
-      <literal>origin</literal>.
+      <literal>disable_on_error</literal>,
+      <literal>origin</literal>, and
+      <literal>min_apply_delay</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..1b4b8390af 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -349,7 +349,49 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
-      </variablelist></para>
+
+       <varlistentry>
+        <term><literal>min_apply_delay</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          By default, the subscriber applies changes as soon as possible. This
+          parameter allows the user to delay the application of changes by a
+          given time period. If the value is specified without units, it is
+          taken as milliseconds. The default is zero (no delay). See
+          <xref linkend="config-setting-names-values"/> for details on the
+          available valid time units.
+         </para>
+         <para>
+          Any delay becomes effective only after all initial table
+          synchronization has finished and occurs before each transaction starts
+          to get applied on the subscriber. The delay is calculated as the
+          difference between the WAL timestamp as written on the publisher and
+          the current time on the subscriber. Any overhead of time spent in
+          logical decoding and in transferring the transaction may reduce the
+          actual wait time. It is also possible that the overhead already
+          exceeds the requested <literal>min_apply_delay</literal> value, in
+          which case no delay is applied. If the system clocks on publisher and
+          subscriber are not synchronized, this may lead to apply changes
+          earlier than expected, but this is not a major issue because this
+          parameter is typically much larger than the time deviations between
+          servers. Note that if this parameter is set to a long delay, the
+          replication will stop if the replication slot falls behind the current
+          LSN by more than
+          <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>.
+         </para>
+         <warning>
+           <para>
+            Delaying the replication means there is a much longer time between
+            making a change on the publisher, and that change being committed
+            on the subscriber. This can impact the performance of synchronous
+            replication. See <xref linkend="guc-synchronous-commit"/>
+            parameter.
+           </para>
+         </warning>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
 
     </listitem>
    </varlistentry>
@@ -420,6 +462,11 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
    published with different column lists are not supported.
   </para>
 
+  <para>
+   A non-zero <literal>min_apply_delay</literal> parameter is not allowed when
+   streaming in parallel mode.
+  </para>
+
   <para>
    We allow non-existent publications to be specified so that users can add
    those later. This means
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a56ae311c3..e19e5cbca2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->skiplsn = subform->subskiplsn;
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
+	sub->minapplydelay = subform->subminapplydelay;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..317c2010cb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1299,9 +1299,10 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-              subbinary, substream, subtwophasestate, subdisableonerr,
-              subslotname, subsynccommit, subpublications, suborigin)
+GRANT SELECT (oid, subdbid, subskiplsn, subminapplydelay, subname, subowner,
+              subenabled, subbinary, substream, subtwophasestate,
+              subdisableonerr, subslotname, subsynccommit, subpublications,
+              suborigin)
     ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..82e16fd0f9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,6 +66,7 @@
 #define SUBOPT_DISABLE_ON_ERR		0x00000400
 #define SUBOPT_LSN					0x00000800
 #define SUBOPT_ORIGIN				0x00001000
+#define SUBOPT_MIN_APPLY_DELAY		0x00002000
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -90,6 +91,7 @@ typedef struct SubOpts
 	bool		disableonerr;
 	char	   *origin;
 	XLogRecPtr	lsn;
+	int32		min_apply_delay;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -100,7 +102,7 @@ static void check_publications_origin(WalReceiverConn *wrconn,
 static void check_duplicates_in_publist(List *publist, Datum *datums);
 static List *merge_publications(List *oldpublist, List *newpublist, bool addpub, const char *subname);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
-
+static int32 defGetMinApplyDelay(DefElem *def);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -146,6 +148,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->disableonerr = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY))
+		opts->min_apply_delay = 0;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -324,6 +328,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_LSN;
 			opts->lsn = lsn;
 		}
+		else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+				 strcmp(defel->defname, "min_apply_delay") == 0)
+		{
+			if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				errorConflictingDefElem(defel, pstate);
+
+			opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY;
+			opts->min_apply_delay = defGetMinApplyDelay(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -404,6 +417,32 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 								"slot_name = NONE", "create_slot = false")));
 		}
 	}
+
+	/*
+	 * The combination of parallel streaming mode and min_apply_delay is not
+	 * allowed. This is because in parallel streaming mode, we start applying
+	 * the transaction stream as soon as the first change arrives without
+	 * knowing the transaction's prepare/commit time. This means we cannot
+	 * calculate the underlying network/decoding lag between publisher and
+	 * subscriber, and so always waiting for the full 'min_apply_delay' period
+	 * might include unnecessary delay.
+	 *
+	 * The other possibility was to apply the delay at the end of the parallel
+	 * apply transaction but that would cause issues related to resource bloat
+	 * and locks being held for a long time.
+	 */
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+		opts->min_apply_delay > 0 &&
+		opts->streaming == LOGICALREP_STREAM_PARALLEL)
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+
+		/*
+		 * translator: the first %s is a string of the form "parameter > 0"
+		 * and the second one is "option = value".
+		 */
+				errmsg("%s and %s are mutually exclusive options",
+					   "min_apply_delay > 0", "streaming = parallel"));
 }
 
 /*
@@ -560,7 +599,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
+					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN |
+					  SUBOPT_MIN_APPLY_DELAY);
 	parse_subscription_options(pstate, stmt->options, supported_opts, &opts);
 
 	/*
@@ -625,6 +665,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
+	values[Anum_pg_subscription_subminapplydelay - 1] = Int32GetDatum(opts.min_apply_delay);
 	values[Anum_pg_subscription_subname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
@@ -1054,7 +1095,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 				supported_opts = (SUBOPT_SLOT_NAME |
 								  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 								  SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR |
-								  SUBOPT_ORIGIN);
+								  SUBOPT_ORIGIN | SUBOPT_MIN_APPLY_DELAY);
 
 				parse_subscription_options(pstate, stmt->options,
 										   supported_opts, &opts);
@@ -1098,6 +1139,19 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.streaming == LOGICALREP_STREAM_PARALLEL &&
+						!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)
+						&& sub->minapplydelay > 0)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set parallel streaming mode for subscription with %s",
+									   "min_apply_delay"));
+
 					values[Anum_pg_subscription_substream - 1] =
 						CharGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -1111,6 +1165,26 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 						= true;
 				}
 
+				if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.min_apply_delay > 0 &&
+						!IsSet(opts.specified_opts, SUBOPT_STREAMING)
+						&& sub->stream == LOGICALREP_STREAM_PARALLEL)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set %s for subscription in parallel streaming mode",
+									   "min_apply_delay"));
+
+					values[Anum_pg_subscription_subminapplydelay - 1] =
+						Int32GetDatum(opts.min_apply_delay);
+					replaces[Anum_pg_subscription_subminapplydelay - 1] = true;
+				}
+
 				if (IsSet(opts.specified_opts, SUBOPT_ORIGIN))
 				{
 					values[Anum_pg_subscription_suborigin - 1] =
@@ -2195,3 +2269,45 @@ defGetStreamingMode(DefElem *def)
 					def->defname)));
 	return LOGICALREP_STREAM_OFF;	/* keep compiler quiet */
 }
+
+/*
+ * Extract the min_apply_delay value from a DefElem. This is very similar to
+ * parse_and_validate_value() for integer values, because min_apply_delay
+ * accepts the same parameter format as recovery_min_apply_delay.
+ */
+static int32
+defGetMinApplyDelay(DefElem *def)
+{
+	char	   *input_string;
+	int			result;
+	const char *hintmsg;
+
+	input_string = defGetString(def);
+
+	/*
+	 * Parse given string as parameter which has millisecond unit
+	 */
+	if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for parameter \"%s\": \"%s\"",
+						"min_apply_delay", input_string),
+				 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+
+	/*
+	 * Check both the lower boundary for the valid min_apply_delay range and
+	 * the upper boundary as the safeguard for some platforms where INT_MAX is
+	 * wider than int32 respectively. Although parse_int() has confirmed that
+	 * the result is less than or equal to INT_MAX, the value will be stored
+	 * in a catalog column of int32.
+	 */
+	if (result < 0 || result > PG_INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)",
+						result,
+						"min_apply_delay",
+						0, PG_INT32_MAX)));
+
+	return result;
+}
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index da437e0bc3..32db20fd98 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void)
 	{
 		apply_spooled_messages(&MyParallelShared->fileset,
 							   MyParallelShared->xid,
-							   InvalidXLogRecPtr);
+							   InvalidXLogRecPtr,
+							   0);
 		pa_set_fileset_state(MyParallelShared, FS_EMPTY);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6248..c574531040 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -319,6 +319,17 @@ static List *on_commit_wakeup_workers_subids = NIL;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/*
+ * In order to avoid walsender timeout for time-delayed logical replication the
+ * apply worker keeps sending feedback messages during the delay period.
+ * Meanwhile, the feature delays the apply before the start of the
+ * transaction and thus we don't write WAL records for the suspended changes
+ * during the wait. When the apply worker sends a feedback message during the
+ * delay, we should not overwrite positions of the flushed and apply LSN by the
+ * last received latest LSN. See send_feedback() for details.
+ */
+static XLogRecPtr last_received = InvalidXLogRecPtr;
+
 /* fields valid only when processing streamed transaction */
 static bool in_streamed_transaction = false;
 
@@ -389,7 +400,8 @@ static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
 static void stream_close_file(void);
 
-static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+						  bool has_unprocessed_change);
 
 static void DisableSubscriptionAndExit(void);
 
@@ -999,6 +1011,109 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 	ExecStoreVirtualTuple(slot);
 }
 
+/*
+ * When min_apply_delay parameter is set on the subscriber, we wait long enough
+ * to make sure a transaction is applied at least that period behind the
+ * publisher.
+ *
+ * While the physical replication applies the delay at commit time, this
+ * feature applies the delay for the next transaction but before starting the
+ * transaction. This is mainly because keeping a transaction that conducted
+ * write operations open for a long time results in some issues such as bloat
+ * and locks.
+ *
+ * The min_apply_delay parameter will take effect only after all tables are in
+ * READY state.
+ *
+ * xid is the transaction id where we apply the delay.
+ *
+ * finish_ts is the commit/prepare time of both regular (non-streamed) and
+ * streamed transactions. Unlike the regular (non-streamed) cases, the delay
+ * is applied in a STREAM COMMIT/STREAM PREPARE message for streamed
+ * transactions. The STREAM START message does not contain a commit/prepare
+ * time (it will be available when the in-progress transaction finishes).
+ * Hence, it's not appropriate to apply a delay at the STREAM START time.
+ */
+static void
+maybe_apply_delay(TransactionId xid, TimestampTz finish_ts)
+{
+	Assert(finish_ts > 0);
+
+	/* Nothing to do if no delay set */
+	if (!MySubscription->minapplydelay)
+		return;
+
+	/*
+	 * The min_apply_delay parameter is ignored until all tablesync workers
+	 * have reached READY state. This is because if we allowed the delay
+	 * during the catchup phase, then once we reached the limit of tablesync
+	 * workers it would impose a delay for each subsequent worker. That would
+	 * cause initial table synchronization completion to take a long time.
+	 */
+	if (!AllTablesyncsReady())
+		return;
+
+	/* Apply the delay by the latch mechanism */
+	while (true)
+	{
+		TimestampTz delayUntil;
+		long		diffms;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* This might change wal_receiver_status_interval */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/*
+		 * Before calculating the time duration, reload the catalog if needed.
+		 */
+		if (!in_remote_transaction && !in_streamed_transaction)
+		{
+			AcceptInvalidationMessages();
+			maybe_reread_subscription();
+		}
+
+		delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay);
+		diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+		/*
+		 * Exit without arming the latch if it's already past time to apply
+		 * this transaction.
+		 */
+		if (diffms <= 0)
+			break;
+
+		elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms",
+			 xid, MySubscription->minapplydelay, diffms);
+
+		/*
+		 * Call send_feedback() to prevent the publisher from exiting by
+		 * timeout during the delay, when wal_receiver_status_interval is
+		 * available.
+		 */
+		if (wal_receiver_status_interval > 0 &&
+			diffms > wal_receiver_status_interval * 1000L)
+		{
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  wal_receiver_status_interval * 1000L,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+			send_feedback(last_received, true, false, true);
+		}
+		else
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  diffms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+	}
+}
+
 /*
  * Handle BEGIN message.
  */
@@ -1013,6 +1128,9 @@ apply_handle_begin(StringInfo s)
 	logicalrep_read_begin(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
 
+	/* Should we delay the current transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.committime);
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	maybe_start_skipping_changes(begin_data.final_lsn);
@@ -1070,6 +1188,9 @@ apply_handle_begin_prepare(StringInfo s)
 	logicalrep_read_begin_prepare(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
 
+	/* Should we delay the current prepared transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.prepare_time);
+
 	remote_final_lsn = begin_data.prepare_lsn;
 
 	maybe_start_skipping_changes(begin_data.prepare_lsn);
@@ -1317,7 +1438,8 @@ apply_handle_stream_prepare(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
-								   prepare_data.xid, prepare_data.prepare_lsn);
+								   prepare_data.xid, prepare_data.prepare_lsn,
+								   prepare_data.prepare_time);
 
 			/* Mark the transaction as prepared. */
 			apply_handle_prepare_internal(&prepare_data);
@@ -2011,10 +2133,13 @@ ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno,
 
 /*
  * Common spoolfile processing.
+ *
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */
 void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-					   XLogRecPtr lsn)
+					   XLogRecPtr lsn, TimestampTz finish_ts)
 {
 	StringInfoData s2;
 	int			nchanges;
@@ -2025,6 +2150,10 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
 	int			fileno;
 	off_t		offset;
 
+	/* Should we delay the current transaction? */
+	if (finish_ts)
+		maybe_apply_delay(xid, finish_ts);
+
 	if (!am_parallel_apply_worker())
 		maybe_start_skipping_changes(lsn);
 
@@ -2174,7 +2303,7 @@ apply_handle_stream_commit(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
-								   commit_data.commit_lsn);
+								   commit_data.commit_lsn, commit_data.committime);
 
 			apply_handle_commit_internal(&commit_data);
 
@@ -3447,7 +3576,7 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
  * Apply main loop.
  */
 static void
-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)
 {
 	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
 	bool		ping_sent = false;
@@ -3568,7 +3697,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						if (last_received < end_lsn)
 							last_received = end_lsn;
 
-						send_feedback(last_received, reply_requested, false);
+						send_feedback(last_received, reply_requested, false, false);
 						UpdateWorkerStats(last_received, timestamp, true);
 					}
 					/* other message types are purposefully ignored */
@@ -3581,7 +3710,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		}
 
 		/* confirm all writes so far */
-		send_feedback(last_received, false, false);
+		send_feedback(last_received, false, false, false);
 
 		if (!in_remote_transaction && !in_streamed_transaction)
 		{
@@ -3678,7 +3807,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				}
 			}
 
-			send_feedback(last_received, requestReply, requestReply);
+			send_feedback(last_received, requestReply, requestReply, false);
 
 			/*
 			 * Force reporting to ensure long idle periods don't lead to
@@ -3708,7 +3837,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  * to send a response to avoid timeouts.
  */
 static void
-send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
+send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, bool has_unprocessed_change)
 {
 	static StringInfo reply_message = NULL;
 	static TimestampTz send_time = 0;
@@ -3738,8 +3867,14 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	/*
 	 * No outstanding transactions to flush, we can report the latest received
 	 * position. This is important for synchronous replication.
+	 *
+	 * If the logical replication subscription has unprocessed changes then do
+	 * not inform the publisher that the received latest LSN is already
+	 * applied and flushed, otherwise, the publisher will make a wrong
+	 * assumption about the logical replication progress. Instead, just send a
+	 * feedback message to avoid a replication timeout during the delay.
 	 */
-	if (!have_pending_txes)
+	if (!have_pending_txes && !has_unprocessed_change)
 		flushpos = writepos = recvpos;
 
 	if (writepos < last_writepos)
@@ -3776,8 +3911,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
-	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
+	elog(DEBUG2, "sending feedback (force %d, has_unprocessed_change %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
+		 has_unprocessed_change,
 		 LSN_FORMAT_ARGS(recvpos),
 		 LSN_FORMAT_ARGS(writepos),
 		 LSN_FORMAT_ARGS(flushpos));
@@ -4367,11 +4503,11 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
  * of system resource error and are not repeatable.
  */
 static void
-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)
 {
 	PG_TRY();
 	{
-		LogicalRepApplyLoop(origin_startpos);
+		LogicalRepApplyLoop();
 	}
 	PG_CATCH();
 	{
@@ -4661,7 +4797,8 @@ ApplyWorkerMain(Datum main_arg)
 	}
 
 	/* Run the main loop. */
-	start_apply(origin_startpos);
+	last_received = origin_startpos;
+	start_apply();
 
 	proc_exit(0);
 }
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..1e87f0124e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4494,6 +4494,7 @@ getSubscriptions(Archive *fout)
 	int			i_subsynccommit;
 	int			i_subpublications;
 	int			i_subbinary;
+	int			i_subminapplydelay;
 	int			i,
 				ntups;
 
@@ -4546,9 +4547,13 @@ getSubscriptions(Archive *fout)
 						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	if (fout->remoteVersion >= 160000)
-		appendPQExpBufferStr(query, " s.suborigin\n");
+		appendPQExpBufferStr(query,
+							 " s.suborigin,\n"
+							 " s.subminapplydelay\n");
 	else
-		appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY);
+		appendPQExpBuffer(query, " '%s' AS suborigin,\n"
+						  " 0 AS subminapplydelay\n",
+						  LOGICALREP_ORIGIN_ANY);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4576,6 +4581,7 @@ getSubscriptions(Archive *fout)
 	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 	i_subdisableonerr = PQfnumber(res, "subdisableonerr");
 	i_suborigin = PQfnumber(res, "suborigin");
+	i_subminapplydelay = PQfnumber(res, "subminapplydelay");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4606,6 +4612,8 @@ getSubscriptions(Archive *fout)
 		subinfo[i].subdisableonerr =
 			pg_strdup(PQgetvalue(res, i, i_subdisableonerr));
 		subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin));
+		subinfo[i].subminapplydelay =
+			atoi(PQgetvalue(res, i, i_subminapplydelay));
 
 		/* Decide whether we want to dump it */
 		selectDumpableObject(&(subinfo[i].dobj), fout);
@@ -4687,6 +4695,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
+	if (subinfo->subminapplydelay > 0)
+		appendPQExpBuffer(query, ", min_apply_delay = '%d ms'", subinfo->subminapplydelay);
+
 	appendPQExpBufferStr(query, ");\n");
 
 	if (subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..b8831c3ed3 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo
 	char	   *subdisableonerr;
 	char	   *suborigin;
 	char	   *subsynccommit;
+	int			subminapplydelay;
 	char	   *subpublications;
 } SubscriptionInfo;
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c8a0bb7b3a..81d4607a1c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6472,7 +6472,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false, false, false, false, false};
+	false, false, false, false, false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6527,10 +6527,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Two-phase commit"),
 							  gettext_noop("Disable on error"));
 
+		/* Origin and min_apply_delay are only supported in v16 and higher */
 		if (pset.sversion >= 160000)
 			appendPQExpBuffer(&buf,
-							  ", suborigin AS \"%s\"\n",
-							  gettext_noop("Origin"));
+							  ", suborigin AS \"%s\"\n"
+							  ", subminapplydelay AS \"%s\"\n",
+							  gettext_noop("Origin"),
+							  gettext_noop("Min apply delay"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e1882eaea..e8b9a43a47 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1925,7 +1925,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "PUBLICATION");
 	/* ALTER SUBSCRIPTION <name> SET ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SET", "("))
-		COMPLETE_WITH("binary", "disable_on_error", "origin", "slot_name",
+		COMPLETE_WITH("binary", "disable_on_error", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit");
 	/* ALTER SUBSCRIPTION <name> SKIP ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SKIP", "("))
@@ -3268,7 +3268,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-					  "disable_on_error", "enabled", "origin", "slot_name",
+					  "disable_on_error", "enabled", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index b0f2a1705d..d1cfefc6d6 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	Oid			subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
 
+	int32		subminapplydelay;	/* Replication apply delay (ms) */
+
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
@@ -122,6 +124,7 @@ typedef struct Subscription
 								 * skipped */
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
+	int32		minapplydelay;	/* Replication apply delay (ms) */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index dc87a4edd1..3dc09d1a4c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -255,7 +255,7 @@ extern void stream_stop_internal(TransactionId xid);
 
 /* Common streaming function to apply all the spooled messages */
 extern void apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-								   XLogRecPtr lsn);
+								   XLogRecPtr lsn, TimestampTz finish_ts);
 
 extern void apply_dispatch(StringInfo s);
 
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 3f99b14394..cf8e727ee9 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -114,18 +114,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION 'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub4 SET (origin = any);
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub3;
@@ -143,10 +143,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -163,10 +163,10 @@ ERROR:  unrecognized subscription parameter: "create_slot"
 -- ok
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/12345');
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/12345
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/12345
 (1 row)
 
 -- ok - with lsn = NONE
@@ -175,10 +175,10 @@ ALTER SUBSCRIPTION regress_testsub SKIP (lsn = NONE);
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/0');
 ERROR:  invalid WAL location (LSN): 0/0
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/0
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 BEGIN;
@@ -210,10 +210,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                                               List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
----------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | local              | dbname=regress_doesnotexist2 | 0/0
+                                                                                                        List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | local              | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 -- rename back to keep the rest simple
@@ -247,19 +247,19 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -271,27 +271,27 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication already exists
@@ -306,10 +306,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                                                 List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication used more than once
@@ -324,10 +324,10 @@ ERROR:  publication "testpub3" is not in subscription "regress_testsub"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -363,10 +363,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 --fail - alter of two_phase option not supported.
@@ -375,10 +375,10 @@ ERROR:  unrecognized subscription parameter: "two_phase"
 -- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -388,10 +388,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -404,20 +404,57 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+ERROR:  invalid value for parameter "min_apply_delay": "foo"
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+ERROR:  -1 ms is outside the valid range for parameter "min_apply_delay" (0 .. 2147483647)
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+ERROR:  min_apply_delay > 0 and streaming = parallel are mutually exclusive options
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |             123 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |        86400000 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+ERROR:  cannot set parallel streaming mode for subscription with min_apply_delay
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ERROR:  cannot set min_apply_delay for subscription in parallel streaming mode
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 7281f5fee2..7317b140f5 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -286,6 +286,30 @@ ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+\dRs+
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 91aa068c95..f94819672b 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -515,6 +515,36 @@ $node_publisher->poll_query_until('postgres',
   or die
   "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
 
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# look the time duration between tuples are inserted on publisher and then
+# changes are replicated on subscriber.
+my $delay = 3;
+
+# Set min_apply_delay parameter to 3 seconds
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+
+# Make new content on publisher and check its presence in subscriber depending
+# on the delay applied above. Before doing the insertion, get the
+# current timestamp that will be used as a comparison base. Even on slow
+# machines, this allows to have a predictable behavior when comparing the
+# delay between data insertion moment on publisher and replay time on subscriber.
+my $publisher_insert_time = time();
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_ins VALUES (generate_series(1101, 1120))");
+
+# The publisher waits for the replication to complete
+$node_publisher->wait_for_catchup('tap_sub_renamed');
+
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+	"subscriber applies WAL only after replication delay for non-streaming transaction"
+);
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
-- 
2.27.0

v4-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchapplication/octet-stream; name=v4-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchDownload
From 6aa798c884ed8fdc56cdf3b6d1f0fe5b70a9ce33 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 7 Feb 2023 05:38:20 +0000
Subject: [PATCH v4 2/2] Extend START_REPLICATION command to accept walsender
 options

This commit extends START_REPLICATION to accept options of walsender. Currently,
only one option exit_before_confirming is accepted.

For physical replication, the grammer of START_REPLICATION is extended to accept
options. Note that in the normal phyical replication the added option is never
used.

For logical replication, the option list for logical decoding plugin is reused for
storing walsender options. When the min_apply_delay parameter is set for a
subscription, the apply worker related with it will send START_REPLICATION query
with exit_before_confirming = true to publisher node.

This option allows primay servers to shut down even if there are pending WALs to
be sent or sent WALs are not flushed on the secondary. This may be useful to
shut down the primary even when the walreceiver/worker is stuck.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/protocol.sgml                    | 18 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c       |  7 ++
 src/backend/replication/logical/worker.c      | 13 ++-
 src/backend/replication/repl_gram.y           | 12 ++-
 src/backend/replication/repl_scanner.l        |  1 +
 src/backend/replication/walreceiver.c         |  1 +
 src/backend/replication/walsender.c           | 80 ++++++++++++++++++-
 src/include/nodes/replnodes.h                 |  1 +
 src/include/replication/walreceiver.h         |  1 +
 src/test/subscription/t/001_rep_changes.pl    | 10 ++-
 src/tools/pgindent/typedefs.list              |  2 +
 11 files changed, 138 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 93fc7167d4..2622b084ed 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2500,7 +2500,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
     </varlistentry>
 
     <varlistentry id="protocol-replication-start-replication-slot-logical">
-     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
+     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>SHUTDOWN_MODE</literal> <replaceable class="parameter">shutdown_mode</replaceable> ] [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
      <listitem>
       <para>
        Instructs server to start streaming WAL for logical replication,
@@ -2555,6 +2555,22 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
         </listitem>
        </varlistentry>
 
+       <varlistentry>
+        <term><literal>SHUTDOWN_MODE { 'wait_flush' | 'immediate' }</literal></term>
+        <listitem>
+         <para>
+          Decides the condition for exiting the walsender process.
+          <literal>'wait_flush'</literal>, which is the default, the walsender
+          will wait for all the sent WALs to be flushed on the subscriber side,
+          before exiting the process. <literal>'immediate'</literal> will exit
+          without confirming the remote flush. This may break the consistency
+          between publisher and subscriber, but it may be useful for a system
+          that has a high-latency network to reduce the amount of time for
+          shutdown.
+         </para>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><replaceable class="parameter">option_name</replaceable></term>
         <listitem>
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 560ec974fa..18f6e09cfd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -403,6 +403,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		List	   *pubnames;
 		char	   *pubnames_literal;
 
+		/* Add SHUTDOWN_MODE option if needed */
+		if (options->shutdown_mode &&
+			PQserverVersion(conn->streamConn) >= 160000)
+			appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
+							 options->shutdown_mode);
+
 		appendStringInfoString(&cmd, " (");
 
 		appendStringInfo(&cmd, "proto_version '%u'",
@@ -449,6 +455,7 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, " TIMELINE %u",
 						 options->proto.physical.startpointTLI);
 
+
 	/* Start streaming. */
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c574531040..feffccfd47 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4034,7 +4034,9 @@ maybe_reread_subscription(void)
 		newsub->stream != MySubscription->stream ||
 		strcmp(newsub->origin, MySubscription->origin) != 0 ||
 		newsub->owner != MySubscription->owner ||
-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		/* minapplydelay affects SHUTDOWN_MODE option */
+		(newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))
 	{
 		if (am_parallel_apply_worker())
 			ereport(LOG,
@@ -4718,6 +4720,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	options.shutdown_mode = NULL;
 
 	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
@@ -4756,6 +4759,14 @@ ApplyWorkerMain(Datum main_arg)
 
 	if (!am_tablesync_worker())
 	{
+		/*
+		 * time-delayed logical replication does not support tablesync
+		 * workers, so only the leader apply worker can request walsenders to
+		 * exit before confirming remote flush.
+		 */
+		if (server_version >= 160000 && MySubscription->minapplydelay > 0)
+			options.shutdown_mode = pstrdup("immediate");
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..54450a041a 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,6 +76,7 @@ Node *replication_parse_result;
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
+%token K_SHUTDOWN_MODE
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -91,6 +92,7 @@ Node *replication_parse_result;
 %type <boolval>	opt_temporary
 %type <list>	create_slot_options create_slot_legacy_opt_list
 %type <defelt>	create_slot_legacy_opt
+%type <str>	opt_shutdown_mode
 
 %%
 
@@ -270,20 +272,22 @@ start_replication:
 					cmd->slotname = $2;
 					cmd->startpoint = $4;
 					cmd->timeline = $5;
+					cmd->shutdownmode = NULL;
 					$$ = (Node *) cmd;
 				}
 			;
 
 /* START_REPLICATION SLOT slot LOGICAL %X/%X options */
 start_logical_replication:
-			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
+			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options
 				{
 					StartReplicationCmd *cmd;
 					cmd = makeNode(StartReplicationCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $3;
 					cmd->startpoint = $5;
-					cmd->options = $6;
+					cmd->shutdownmode = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -336,6 +340,10 @@ opt_timeline:
 				| /* EMPTY */			{ $$ = 0; }
 			;
 
+opt_shutdown_mode:
+			K_SHUTDOWN_MODE SCONST			{ $$ = $2; }
+			| /* EMPTY */					{ $$ = NULL; }
+		;
 
 plugin_options:
 			'(' plugin_opt_list ')'			{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..fcc6f6feda 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
 WAIT				{ return K_WAIT; }
+SHUTDOWN_MODE		{ return K_SHUTDOWN_MODE; }
 
 {space}+		{ /* do nothing */ }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..cfce9d93ef 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -409,6 +409,7 @@ WalReceiverMain(void)
 		options.logical = false;
 		options.startpoint = startpoint;
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
+		options.shutdown_mode = NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		if (walrcv_startstreaming(wrconn, &options))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..65d08bdc95 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -219,6 +219,25 @@ typedef struct
 
 static LagTracker *lag_tracker;
 
+/* Indicator for specifying the shutdown mode */
+typedef enum
+{
+	WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+	WALSND_SHUTDOWN_MODE_IMMIDEATE
+} WalSndShutdownMode;
+
+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+	WalSndShutdownMode shutdown_mode;
+} WalSndOptions;
+
+static WalSndOptions *my_options = NULL;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -260,6 +279,8 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+static void CheckWalSndOptions(const StartReplicationCmd *cmd);
+static void ParseShutdownMode(char *shutdownmode);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -1272,6 +1293,12 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/* Initialize an option holder */
+	my_options = (WalSndOptions *) palloc0(sizeof(WalSndOptions));
+
+	/* Check given options and set value to the holder */
+	CheckWalSndOptions(cmd);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1450,6 +1477,16 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/*
+		 * In this function, there is a possibility that the walsender is
+		 * stuck. It is caused when the opposite worker is stuck and then the
+		 * send-buffer of the walsender becomes full. Therefore, we must add
+		 * an additional path for shutdown for immediate shutdown mode.
+		 */
+		if (my_options->shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE &&
+			got_STOPPING)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -3114,19 +3151,26 @@ WalSndDone(WalSndSendDataCallback send_data)
 	 * To figure out whether all WAL has successfully been replicated, check
 	 * flush location if valid, write otherwise. Tools like pg_receivewal will
 	 * usually (unless in synchronous mode) return an invalid flush location.
+	 *
+	 * If we are in the immediate shutdown mode, flush location and output
+	 * buffer is not checked. This may break the consistency between nodes,
+	 * but it may be useful for the system that has high-latency network to
+	 * reduce the amount of time for shutdown.
 	 */
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	if (WalSndCaughtUp &&
+		((my_options &&
+		  my_options->shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE) ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
@@ -3849,3 +3893,33 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Check options for walsender itself and set a value to an option holder.
+ *
+ * Currently only one option is accepted.
+ */
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+	if (cmd->shutdownmode)
+		ParseShutdownMode(cmd->shutdownmode);
+}
+
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+	if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+		my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+	else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+		my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+	else
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+				errmsg("SHUTDOWN_MODE requires \"wait_flush\" or \"immediate\""));
+}
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..c96e85e859 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -83,6 +83,7 @@ typedef struct StartReplicationCmd
 	char	   *slotname;
 	TimeLineID	timeline;
 	XLogRecPtr	startpoint;
+	char	   *shutdownmode;
 	List	   *options;
 } StartReplicationCmd;
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index decffe352d..ef6297da52 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -170,6 +170,7 @@ typedef struct
 								 * false if physical stream.  */
 	char	   *slotname;		/* Name of the replication slot or NULL. */
 	XLogRecPtr	startpoint;		/* LSN of starting point. */
+	char	   *shutdown_mode;	/* Name of specified shutdown name */
 
 	union
 	{
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index f94819672b..d7a6fd0e38 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -523,9 +523,17 @@ $node_publisher->poll_query_until('postgres',
 # changes are replicated on subscriber.
 my $delay = 3;
 
-# Set min_apply_delay parameter to 3 seconds
+# check restart on changing min_apply_delay to 3 seconds
+$oldpid = $node_publisher->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+);
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for apply to restart after changing min_apply_delay to non-zero value";
 
 # Make new content on publisher and check its presence in subscriber depending
 # on the delay applied above. Before doing the insertion, get the
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3224dfc36..837cbf8d9d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2969,7 +2969,9 @@ WalReceiverConn
 WalReceiverFunctionsType
 WalSnd
 WalSndCtlData
+WalSndOptions
 WalSndSendDataCallback
+WalSndShutdownMode
 WalSndState
 WalTimeSample
 WalUsage
-- 
2.27.0

#47Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#46)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thank you for checking the patch! PSA new version.

PSA rebased patch that supports updated time-delayed patch[1]/messages/by-id/TYAPR01MB5866C11DAF8AB04F3CC181D3F5D89@TYAPR01MB5866.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYAPR01MB5866C11DAF8AB04F3CC181D3F5D89@TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v5-0001-Time-delayed-logical-replication-subscriber.patchapplication/octet-stream; name=v5-0001-Time-delayed-logical-replication-subscriber.patchDownload
From fb6d0284f2ffadcdefc871eefde8326f7a6203e0 Mon Sep 17 00:00:00 2001
From: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Date: Tue, 7 Feb 2023 13:05:34 +0000
Subject: [PATCH v5 1/2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

This patch implements a new subscription parameter called 'min_apply_delay'.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs before we start to apply the transaction on the
subscriber. The main reason is to avoid keeping a transaction open for
a long time. Regular and prepared transactions are covered. Streamed
transactions are also covered.

The combination of parallel streaming mode and min_apply_delay is not
allowed. This is because in parallel streaming mode, we start applying
the transaction stream as soon as the first change arrives without
knowing the transaction's prepare/commit time. This means we cannot
calculate the underlying network/decoding lag between publisher and
subscriber, and so always waiting for the full 'min_apply_delay' period
might include unnecessary delay.

The other possibility was to apply the delay at the end of the parallel
apply transaction but that would cause issues related to resource
bloat and locks being held for a long time.

Note that this feature doesn't interact with skip transaction feature.
The skip transaction feature applies to one transaction with a specific LSN.
So, even if the skipped transaction and non-skipped transaction come
consecutively in a very short time, regardless of the order of which comes
first, the time-delayed feature gets balanced by delayed application
for other transactions before and after the skipped transaction.

Author: Euler Taveira, Takamichi Osumi, Kuroda Hayato
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Shveta Malik,
             Kyotaro Horiguchi, Shi Yu, Wang Wei, Dilip Kumar, Melih Mutlu
Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   9 +
 doc/src/sgml/config.sgml                      |  12 ++
 doc/src/sgml/glossary.sgml                    |  15 ++
 doc/src/sgml/logical-replication.sgml         |   6 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  49 ++++-
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/catalog/system_views.sql          |   7 +-
 src/backend/commands/subscriptioncmds.c       | 122 +++++++++++-
 .../replication/logical/applyparallelworker.c |   3 +-
 src/backend/replication/logical/worker.c      | 171 +++++++++++++++--
 src/bin/pg_dump/pg_dump.c                     |  15 +-
 src/bin/pg_dump/pg_dump.h                     |   1 +
 src/bin/psql/describe.c                       |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/replication/worker_internal.h     |   2 +-
 src/test/regress/expected/subscription.out    | 181 +++++++++++-------
 src/test/regress/sql/subscription.sql         |  24 +++
 src/test/subscription/t/001_rep_changes.pl    |  28 +++
 20 files changed, 563 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5dc5ca1133 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7873,6 +7873,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subminapplydelay</structfield> <type>int4</type>
+      </para>
+      <para>
+       The minimum delay, in milliseconds, for applying changes
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subname</structfield> <type>name</type>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d190be1925..626a8b5bd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,18 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        the <filename>postgresql.conf</filename> file or on the server
        command line.
       </para>
+      <para>
+       For time-delayed logical replication, the apply worker sends a feedback
+       message to the publisher every
+       <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+       to set <varname>wal_receiver_status_interval</varname> less than the
+       <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+       <literal>walsender</literal> will repeatedly terminate due to timeout
+       errors. Note that if <varname>wal_receiver_status_interval</varname> is
+       set to zero, the apply worker sends no feedback messages during the
+       <literal>min_apply_delay</literal> period. Refer to
+       <xref linkend="sql-createsubscription"/> for more information.
+      </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..9ede9d05f6 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1729,6 +1729,21 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-time-delayed-replication">
+   <glossterm>Time-delayed replication</glossterm>
+   <glossdef>
+    <para>
+     Replication setup that delays the application of changes by a specified
+     minimum time-delay period.
+    </para>
+    <para>
+     For more information, see
+     <xref linkend="guc-recovery-min-apply-delay"/> for physical replication
+     and <xref linkend="sql-createsubscription"/> for logical replication.
+    </para>
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-toast">
    <glossterm>TOAST</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6bd5f61e2b 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -247,6 +247,12 @@
    target table.
   </para>
 
+  <para>
+   A subscription can delay the application of changes by specifying the
+   <literal>min_apply_delay</literal> subscription parameter. See
+   <xref linkend="sql-createsubscription"/> for details.
+  </para>
+
   <sect2 id="logical-replication-subscription-slot">
    <title>Replication Slot Management</title>
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..8b7eb28e54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -213,8 +213,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
       <literal>binary</literal>, <literal>streaming</literal>,
-      <literal>disable_on_error</literal>, and
-      <literal>origin</literal>.
+      <literal>disable_on_error</literal>,
+      <literal>origin</literal>, and
+      <literal>min_apply_delay</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..1b4b8390af 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -349,7 +349,49 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
-      </variablelist></para>
+
+       <varlistentry>
+        <term><literal>min_apply_delay</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          By default, the subscriber applies changes as soon as possible. This
+          parameter allows the user to delay the application of changes by a
+          given time period. If the value is specified without units, it is
+          taken as milliseconds. The default is zero (no delay). See
+          <xref linkend="config-setting-names-values"/> for details on the
+          available valid time units.
+         </para>
+         <para>
+          Any delay becomes effective only after all initial table
+          synchronization has finished and occurs before each transaction starts
+          to get applied on the subscriber. The delay is calculated as the
+          difference between the WAL timestamp as written on the publisher and
+          the current time on the subscriber. Any overhead of time spent in
+          logical decoding and in transferring the transaction may reduce the
+          actual wait time. It is also possible that the overhead already
+          exceeds the requested <literal>min_apply_delay</literal> value, in
+          which case no delay is applied. If the system clocks on publisher and
+          subscriber are not synchronized, this may lead to apply changes
+          earlier than expected, but this is not a major issue because this
+          parameter is typically much larger than the time deviations between
+          servers. Note that if this parameter is set to a long delay, the
+          replication will stop if the replication slot falls behind the current
+          LSN by more than
+          <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>.
+         </para>
+         <warning>
+           <para>
+            Delaying the replication means there is a much longer time between
+            making a change on the publisher, and that change being committed
+            on the subscriber. This can impact the performance of synchronous
+            replication. See <xref linkend="guc-synchronous-commit"/>
+            parameter.
+           </para>
+         </warning>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
 
     </listitem>
    </varlistentry>
@@ -420,6 +462,11 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
    published with different column lists are not supported.
   </para>
 
+  <para>
+   A non-zero <literal>min_apply_delay</literal> parameter is not allowed when
+   streaming in parallel mode.
+  </para>
+
   <para>
    We allow non-existent publications to be specified so that users can add
    those later. This means
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a56ae311c3..e19e5cbca2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->skiplsn = subform->subskiplsn;
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
+	sub->minapplydelay = subform->subminapplydelay;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..317c2010cb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1299,9 +1299,10 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-              subbinary, substream, subtwophasestate, subdisableonerr,
-              subslotname, subsynccommit, subpublications, suborigin)
+GRANT SELECT (oid, subdbid, subskiplsn, subminapplydelay, subname, subowner,
+              subenabled, subbinary, substream, subtwophasestate,
+              subdisableonerr, subslotname, subsynccommit, subpublications,
+              suborigin)
     ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..82e16fd0f9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,6 +66,7 @@
 #define SUBOPT_DISABLE_ON_ERR		0x00000400
 #define SUBOPT_LSN					0x00000800
 #define SUBOPT_ORIGIN				0x00001000
+#define SUBOPT_MIN_APPLY_DELAY		0x00002000
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -90,6 +91,7 @@ typedef struct SubOpts
 	bool		disableonerr;
 	char	   *origin;
 	XLogRecPtr	lsn;
+	int32		min_apply_delay;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -100,7 +102,7 @@ static void check_publications_origin(WalReceiverConn *wrconn,
 static void check_duplicates_in_publist(List *publist, Datum *datums);
 static List *merge_publications(List *oldpublist, List *newpublist, bool addpub, const char *subname);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
-
+static int32 defGetMinApplyDelay(DefElem *def);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -146,6 +148,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->disableonerr = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY))
+		opts->min_apply_delay = 0;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -324,6 +328,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_LSN;
 			opts->lsn = lsn;
 		}
+		else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+				 strcmp(defel->defname, "min_apply_delay") == 0)
+		{
+			if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				errorConflictingDefElem(defel, pstate);
+
+			opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY;
+			opts->min_apply_delay = defGetMinApplyDelay(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -404,6 +417,32 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 								"slot_name = NONE", "create_slot = false")));
 		}
 	}
+
+	/*
+	 * The combination of parallel streaming mode and min_apply_delay is not
+	 * allowed. This is because in parallel streaming mode, we start applying
+	 * the transaction stream as soon as the first change arrives without
+	 * knowing the transaction's prepare/commit time. This means we cannot
+	 * calculate the underlying network/decoding lag between publisher and
+	 * subscriber, and so always waiting for the full 'min_apply_delay' period
+	 * might include unnecessary delay.
+	 *
+	 * The other possibility was to apply the delay at the end of the parallel
+	 * apply transaction but that would cause issues related to resource bloat
+	 * and locks being held for a long time.
+	 */
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+		opts->min_apply_delay > 0 &&
+		opts->streaming == LOGICALREP_STREAM_PARALLEL)
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+
+		/*
+		 * translator: the first %s is a string of the form "parameter > 0"
+		 * and the second one is "option = value".
+		 */
+				errmsg("%s and %s are mutually exclusive options",
+					   "min_apply_delay > 0", "streaming = parallel"));
 }
 
 /*
@@ -560,7 +599,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
+					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN |
+					  SUBOPT_MIN_APPLY_DELAY);
 	parse_subscription_options(pstate, stmt->options, supported_opts, &opts);
 
 	/*
@@ -625,6 +665,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
+	values[Anum_pg_subscription_subminapplydelay - 1] = Int32GetDatum(opts.min_apply_delay);
 	values[Anum_pg_subscription_subname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
@@ -1054,7 +1095,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 				supported_opts = (SUBOPT_SLOT_NAME |
 								  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 								  SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR |
-								  SUBOPT_ORIGIN);
+								  SUBOPT_ORIGIN | SUBOPT_MIN_APPLY_DELAY);
 
 				parse_subscription_options(pstate, stmt->options,
 										   supported_opts, &opts);
@@ -1098,6 +1139,19 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.streaming == LOGICALREP_STREAM_PARALLEL &&
+						!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)
+						&& sub->minapplydelay > 0)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set parallel streaming mode for subscription with %s",
+									   "min_apply_delay"));
+
 					values[Anum_pg_subscription_substream - 1] =
 						CharGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -1111,6 +1165,26 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 						= true;
 				}
 
+				if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.min_apply_delay > 0 &&
+						!IsSet(opts.specified_opts, SUBOPT_STREAMING)
+						&& sub->stream == LOGICALREP_STREAM_PARALLEL)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set %s for subscription in parallel streaming mode",
+									   "min_apply_delay"));
+
+					values[Anum_pg_subscription_subminapplydelay - 1] =
+						Int32GetDatum(opts.min_apply_delay);
+					replaces[Anum_pg_subscription_subminapplydelay - 1] = true;
+				}
+
 				if (IsSet(opts.specified_opts, SUBOPT_ORIGIN))
 				{
 					values[Anum_pg_subscription_suborigin - 1] =
@@ -2195,3 +2269,45 @@ defGetStreamingMode(DefElem *def)
 					def->defname)));
 	return LOGICALREP_STREAM_OFF;	/* keep compiler quiet */
 }
+
+/*
+ * Extract the min_apply_delay value from a DefElem. This is very similar to
+ * parse_and_validate_value() for integer values, because min_apply_delay
+ * accepts the same parameter format as recovery_min_apply_delay.
+ */
+static int32
+defGetMinApplyDelay(DefElem *def)
+{
+	char	   *input_string;
+	int			result;
+	const char *hintmsg;
+
+	input_string = defGetString(def);
+
+	/*
+	 * Parse given string as parameter which has millisecond unit
+	 */
+	if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for parameter \"%s\": \"%s\"",
+						"min_apply_delay", input_string),
+				 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+
+	/*
+	 * Check both the lower boundary for the valid min_apply_delay range and
+	 * the upper boundary as the safeguard for some platforms where INT_MAX is
+	 * wider than int32 respectively. Although parse_int() has confirmed that
+	 * the result is less than or equal to INT_MAX, the value will be stored
+	 * in a catalog column of int32.
+	 */
+	if (result < 0 || result > PG_INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)",
+						result,
+						"min_apply_delay",
+						0, PG_INT32_MAX)));
+
+	return result;
+}
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index da437e0bc3..32db20fd98 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void)
 	{
 		apply_spooled_messages(&MyParallelShared->fileset,
 							   MyParallelShared->xid,
-							   InvalidXLogRecPtr);
+							   InvalidXLogRecPtr,
+							   0);
 		pa_set_fileset_state(MyParallelShared, FS_EMPTY);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6248..e52143b588 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -319,6 +319,17 @@ static List *on_commit_wakeup_workers_subids = NIL;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/*
+ * In order to avoid walsender timeout for time-delayed logical replication the
+ * apply worker keeps sending feedback messages during the delay period.
+ * Meanwhile, the feature delays the apply before the start of the
+ * transaction and thus we don't write WAL records for the suspended changes
+ * during the wait. When the apply worker sends a feedback message during the
+ * delay, we should not overwrite positions of the flushed and apply LSN by the
+ * last received latest LSN. See send_feedback() for details.
+ */
+static XLogRecPtr last_received = InvalidXLogRecPtr;
+
 /* fields valid only when processing streamed transaction */
 static bool in_streamed_transaction = false;
 
@@ -389,7 +400,8 @@ static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
 static void stream_close_file(void);
 
-static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+						  bool has_unprocessed_change);
 
 static void DisableSubscriptionAndExit(void);
 
@@ -999,6 +1011,115 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 	ExecStoreVirtualTuple(slot);
 }
 
+/*
+ * When min_apply_delay parameter is set on the subscriber, we wait long enough
+ * to make sure a transaction is applied at least that period behind the
+ * publisher.
+ *
+ * While the physical replication applies the delay at commit time, this
+ * feature applies the delay for the next transaction but before starting the
+ * transaction. This is mainly because keeping a transaction that conducted
+ * write operations open for a long time results in some issues such as bloat
+ * and locks.
+ *
+ * The min_apply_delay parameter will take effect only after all tables are in
+ * READY state.
+ *
+ * xid is the transaction id where we apply the delay.
+ *
+ * finish_ts is the commit/prepare time of both regular (non-streamed) and
+ * streamed transactions. Unlike the regular (non-streamed) cases, the delay
+ * is applied in a STREAM COMMIT/STREAM PREPARE message for streamed
+ * transactions. The STREAM START message does not contain a commit/prepare
+ * time (it will be available when the in-progress transaction finishes).
+ * Hence, it's not appropriate to apply a delay at the STREAM START time.
+ */
+static void
+maybe_apply_delay(TransactionId xid, TimestampTz finish_ts)
+{
+	long statusinterval_ms;
+
+	Assert(finish_ts > 0);
+
+	/* Nothing to do if no delay set */
+	if (!MySubscription->minapplydelay)
+		return;
+
+	/*
+	 * The min_apply_delay parameter is ignored until all tablesync workers
+	 * have reached READY state. This is because if we allowed the delay
+	 * during the catchup phase, then once we reached the limit of tablesync
+	 * workers it would impose a delay for each subsequent worker. That would
+	 * cause initial table synchronization completion to take a long time.
+	 */
+	if (!AllTablesyncsReady())
+		return;
+
+	/* Calculate the time interval between status reports */
+	statusinterval_ms = wal_receiver_status_interval * 1000L;
+
+	/* Apply the delay by the latch mechanism */
+	while (true)
+	{
+		TimestampTz delayUntil;
+		long		diffms;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* This might change wal_receiver_status_interval */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* Re-calculate the time interval between status reports */
+			statusinterval_ms = wal_receiver_status_interval * 1000L;
+		}
+
+		/*
+		 * Before calculating the time duration, reload the catalog if needed.
+		 */
+		if (!in_remote_transaction && !in_streamed_transaction)
+		{
+			AcceptInvalidationMessages();
+			maybe_reread_subscription();
+		}
+
+		delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay);
+		diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+		/*
+		 * Exit without arming the latch if it's already past time to apply
+		 * this transaction.
+		 */
+		if (diffms <= 0)
+			break;
+
+		elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms",
+			 xid, MySubscription->minapplydelay, diffms);
+
+		/*
+		 * Call send_feedback() to prevent the publisher from exiting by
+		 * timeout during the delay, when statusinterval_ms is greater than
+		 * zero.
+		 */
+		if (statusinterval_ms > 0 && diffms > statusinterval_ms)
+		{
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  statusinterval_ms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+			send_feedback(last_received, true, false, true);
+		}
+		else
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  diffms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+	}
+}
+
 /*
  * Handle BEGIN message.
  */
@@ -1013,6 +1134,9 @@ apply_handle_begin(StringInfo s)
 	logicalrep_read_begin(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
 
+	/* Should we delay the current transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.committime);
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	maybe_start_skipping_changes(begin_data.final_lsn);
@@ -1070,6 +1194,9 @@ apply_handle_begin_prepare(StringInfo s)
 	logicalrep_read_begin_prepare(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
 
+	/* Should we delay the current prepared transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.prepare_time);
+
 	remote_final_lsn = begin_data.prepare_lsn;
 
 	maybe_start_skipping_changes(begin_data.prepare_lsn);
@@ -1317,7 +1444,8 @@ apply_handle_stream_prepare(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
-								   prepare_data.xid, prepare_data.prepare_lsn);
+								   prepare_data.xid, prepare_data.prepare_lsn,
+								   prepare_data.prepare_time);
 
 			/* Mark the transaction as prepared. */
 			apply_handle_prepare_internal(&prepare_data);
@@ -2011,10 +2139,13 @@ ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno,
 
 /*
  * Common spoolfile processing.
+ *
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */
 void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-					   XLogRecPtr lsn)
+					   XLogRecPtr lsn, TimestampTz finish_ts)
 {
 	StringInfoData s2;
 	int			nchanges;
@@ -2025,6 +2156,10 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
 	int			fileno;
 	off_t		offset;
 
+	/* Should we delay the current transaction? */
+	if (finish_ts)
+		maybe_apply_delay(xid, finish_ts);
+
 	if (!am_parallel_apply_worker())
 		maybe_start_skipping_changes(lsn);
 
@@ -2174,7 +2309,7 @@ apply_handle_stream_commit(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
-								   commit_data.commit_lsn);
+								   commit_data.commit_lsn, commit_data.committime);
 
 			apply_handle_commit_internal(&commit_data);
 
@@ -3447,7 +3582,7 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
  * Apply main loop.
  */
 static void
-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)
 {
 	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
 	bool		ping_sent = false;
@@ -3568,7 +3703,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						if (last_received < end_lsn)
 							last_received = end_lsn;
 
-						send_feedback(last_received, reply_requested, false);
+						send_feedback(last_received, reply_requested, false, false);
 						UpdateWorkerStats(last_received, timestamp, true);
 					}
 					/* other message types are purposefully ignored */
@@ -3581,7 +3716,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		}
 
 		/* confirm all writes so far */
-		send_feedback(last_received, false, false);
+		send_feedback(last_received, false, false, false);
 
 		if (!in_remote_transaction && !in_streamed_transaction)
 		{
@@ -3678,7 +3813,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				}
 			}
 
-			send_feedback(last_received, requestReply, requestReply);
+			send_feedback(last_received, requestReply, requestReply, false);
 
 			/*
 			 * Force reporting to ensure long idle periods don't lead to
@@ -3708,7 +3843,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  * to send a response to avoid timeouts.
  */
 static void
-send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
+send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, bool has_unprocessed_change)
 {
 	static StringInfo reply_message = NULL;
 	static TimestampTz send_time = 0;
@@ -3738,8 +3873,14 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	/*
 	 * No outstanding transactions to flush, we can report the latest received
 	 * position. This is important for synchronous replication.
+	 *
+	 * If the logical replication subscription has unprocessed changes then do
+	 * not inform the publisher that the received latest LSN is already
+	 * applied and flushed, otherwise, the publisher will make a wrong
+	 * assumption about the logical replication progress. Instead, just send a
+	 * feedback message to avoid a replication timeout during the delay.
 	 */
-	if (!have_pending_txes)
+	if (!have_pending_txes && !has_unprocessed_change)
 		flushpos = writepos = recvpos;
 
 	if (writepos < last_writepos)
@@ -3776,8 +3917,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
-	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
+	elog(DEBUG2, "sending feedback (force %d, has_unprocessed_change %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
+		 has_unprocessed_change,
 		 LSN_FORMAT_ARGS(recvpos),
 		 LSN_FORMAT_ARGS(writepos),
 		 LSN_FORMAT_ARGS(flushpos));
@@ -4367,11 +4509,11 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
  * of system resource error and are not repeatable.
  */
 static void
-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)
 {
 	PG_TRY();
 	{
-		LogicalRepApplyLoop(origin_startpos);
+		LogicalRepApplyLoop();
 	}
 	PG_CATCH();
 	{
@@ -4661,7 +4803,8 @@ ApplyWorkerMain(Datum main_arg)
 	}
 
 	/* Run the main loop. */
-	start_apply(origin_startpos);
+	last_received = origin_startpos;
+	start_apply();
 
 	proc_exit(0);
 }
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..1e87f0124e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4494,6 +4494,7 @@ getSubscriptions(Archive *fout)
 	int			i_subsynccommit;
 	int			i_subpublications;
 	int			i_subbinary;
+	int			i_subminapplydelay;
 	int			i,
 				ntups;
 
@@ -4546,9 +4547,13 @@ getSubscriptions(Archive *fout)
 						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	if (fout->remoteVersion >= 160000)
-		appendPQExpBufferStr(query, " s.suborigin\n");
+		appendPQExpBufferStr(query,
+							 " s.suborigin,\n"
+							 " s.subminapplydelay\n");
 	else
-		appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY);
+		appendPQExpBuffer(query, " '%s' AS suborigin,\n"
+						  " 0 AS subminapplydelay\n",
+						  LOGICALREP_ORIGIN_ANY);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4576,6 +4581,7 @@ getSubscriptions(Archive *fout)
 	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 	i_subdisableonerr = PQfnumber(res, "subdisableonerr");
 	i_suborigin = PQfnumber(res, "suborigin");
+	i_subminapplydelay = PQfnumber(res, "subminapplydelay");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4606,6 +4612,8 @@ getSubscriptions(Archive *fout)
 		subinfo[i].subdisableonerr =
 			pg_strdup(PQgetvalue(res, i, i_subdisableonerr));
 		subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin));
+		subinfo[i].subminapplydelay =
+			atoi(PQgetvalue(res, i, i_subminapplydelay));
 
 		/* Decide whether we want to dump it */
 		selectDumpableObject(&(subinfo[i].dobj), fout);
@@ -4687,6 +4695,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
+	if (subinfo->subminapplydelay > 0)
+		appendPQExpBuffer(query, ", min_apply_delay = '%d ms'", subinfo->subminapplydelay);
+
 	appendPQExpBufferStr(query, ");\n");
 
 	if (subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..b8831c3ed3 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo
 	char	   *subdisableonerr;
 	char	   *suborigin;
 	char	   *subsynccommit;
+	int			subminapplydelay;
 	char	   *subpublications;
 } SubscriptionInfo;
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c8a0bb7b3a..81d4607a1c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6472,7 +6472,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false, false, false, false, false};
+	false, false, false, false, false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6527,10 +6527,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Two-phase commit"),
 							  gettext_noop("Disable on error"));
 
+		/* Origin and min_apply_delay are only supported in v16 and higher */
 		if (pset.sversion >= 160000)
 			appendPQExpBuffer(&buf,
-							  ", suborigin AS \"%s\"\n",
-							  gettext_noop("Origin"));
+							  ", suborigin AS \"%s\"\n"
+							  ", subminapplydelay AS \"%s\"\n",
+							  gettext_noop("Origin"),
+							  gettext_noop("Min apply delay"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e1882eaea..e8b9a43a47 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1925,7 +1925,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "PUBLICATION");
 	/* ALTER SUBSCRIPTION <name> SET ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SET", "("))
-		COMPLETE_WITH("binary", "disable_on_error", "origin", "slot_name",
+		COMPLETE_WITH("binary", "disable_on_error", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit");
 	/* ALTER SUBSCRIPTION <name> SKIP ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SKIP", "("))
@@ -3268,7 +3268,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-					  "disable_on_error", "enabled", "origin", "slot_name",
+					  "disable_on_error", "enabled", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index b0f2a1705d..d1cfefc6d6 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	Oid			subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
 
+	int32		subminapplydelay;	/* Replication apply delay (ms) */
+
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
@@ -122,6 +124,7 @@ typedef struct Subscription
 								 * skipped */
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
+	int32		minapplydelay;	/* Replication apply delay (ms) */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index dc87a4edd1..3dc09d1a4c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -255,7 +255,7 @@ extern void stream_stop_internal(TransactionId xid);
 
 /* Common streaming function to apply all the spooled messages */
 extern void apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-								   XLogRecPtr lsn);
+								   XLogRecPtr lsn, TimestampTz finish_ts);
 
 extern void apply_dispatch(StringInfo s);
 
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 3f99b14394..cf8e727ee9 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -114,18 +114,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION 'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub4 SET (origin = any);
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub3;
@@ -143,10 +143,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -163,10 +163,10 @@ ERROR:  unrecognized subscription parameter: "create_slot"
 -- ok
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/12345');
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/12345
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/12345
 (1 row)
 
 -- ok - with lsn = NONE
@@ -175,10 +175,10 @@ ALTER SUBSCRIPTION regress_testsub SKIP (lsn = NONE);
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/0');
 ERROR:  invalid WAL location (LSN): 0/0
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/0
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 BEGIN;
@@ -210,10 +210,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                                               List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
----------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | local              | dbname=regress_doesnotexist2 | 0/0
+                                                                                                        List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | local              | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 -- rename back to keep the rest simple
@@ -247,19 +247,19 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -271,27 +271,27 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication already exists
@@ -306,10 +306,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                                                 List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication used more than once
@@ -324,10 +324,10 @@ ERROR:  publication "testpub3" is not in subscription "regress_testsub"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -363,10 +363,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 --fail - alter of two_phase option not supported.
@@ -375,10 +375,10 @@ ERROR:  unrecognized subscription parameter: "two_phase"
 -- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -388,10 +388,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -404,20 +404,57 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+ERROR:  invalid value for parameter "min_apply_delay": "foo"
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+ERROR:  -1 ms is outside the valid range for parameter "min_apply_delay" (0 .. 2147483647)
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+ERROR:  min_apply_delay > 0 and streaming = parallel are mutually exclusive options
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |             123 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |        86400000 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+ERROR:  cannot set parallel streaming mode for subscription with min_apply_delay
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ERROR:  cannot set min_apply_delay for subscription in parallel streaming mode
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 7281f5fee2..7317b140f5 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -286,6 +286,30 @@ ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+\dRs+
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 91aa068c95..75fd77b891 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -515,6 +515,34 @@ $node_publisher->poll_query_until('postgres',
   or die
   "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
 
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# verify this by looking at the time difference between a) when tuples are
+# inserted on the publisher, and b) when those changes are replicated on the
+# subscriber. Even on slow machines, this strategy will give predictable behavior.
+
+# Set min_apply_delay parameter to 3 seconds
+my $delay = 3;
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+
+# Before doing the insertion, get the current timestamp that will be
+# used as a comparison base.
+my $publisher_insert_time = time();
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_ins VALUES (generate_series(1101, 1120))");
+
+# The publisher waits for the replication to complete
+$node_publisher->wait_for_catchup('tap_sub_renamed');
+
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+	"subscriber applies WAL only after replication delay for non-streaming transaction"
+);
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
-- 
2.27.0

v5-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchapplication/octet-stream; name=v5-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchDownload
From 0385ecc25e2b8486843fa190760f6beb133fd2f0 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 8 Feb 2023 09:09:31 +0000
Subject: [PATCH v5 2/2] Extend START_REPLICATION command to accept walsender
 options

This commit extends START_REPLICATION to accept SHUTDOWN_MODE term. Currently,
it works well only for logical replication.

When 'wait_flush', which is the default, is specified, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before exiting the
process. 'immediate' will exit without confirming the remote flush. This may
break the consistency between publisher and subscriber, but it may be useful
for a system that has a high-latency network to reduce the amount of time for
shutdown. This may be useful to shut down the publisher even when the
worker is stuck.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/protocol.sgml                    | 18 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c       |  7 ++
 src/backend/replication/logical/worker.c      | 13 ++-
 src/backend/replication/repl_gram.y           | 12 ++-
 src/backend/replication/repl_scanner.l        |  1 +
 src/backend/replication/walreceiver.c         |  1 +
 src/backend/replication/walsender.c           | 80 ++++++++++++++++++-
 src/include/nodes/replnodes.h                 |  1 +
 src/include/replication/walreceiver.h         |  1 +
 src/test/subscription/t/001_rep_changes.pl    |  7 +-
 src/tools/pgindent/typedefs.list              |  2 +
 11 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 93fc7167d4..2622b084ed 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2500,7 +2500,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
     </varlistentry>
 
     <varlistentry id="protocol-replication-start-replication-slot-logical">
-     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
+     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>SHUTDOWN_MODE</literal> <replaceable class="parameter">shutdown_mode</replaceable> ] [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
      <listitem>
       <para>
        Instructs server to start streaming WAL for logical replication,
@@ -2555,6 +2555,22 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
         </listitem>
        </varlistentry>
 
+       <varlistentry>
+        <term><literal>SHUTDOWN_MODE { 'wait_flush' | 'immediate' }</literal></term>
+        <listitem>
+         <para>
+          Decides the condition for exiting the walsender process.
+          <literal>'wait_flush'</literal>, which is the default, the walsender
+          will wait for all the sent WALs to be flushed on the subscriber side,
+          before exiting the process. <literal>'immediate'</literal> will exit
+          without confirming the remote flush. This may break the consistency
+          between publisher and subscriber, but it may be useful for a system
+          that has a high-latency network to reduce the amount of time for
+          shutdown.
+         </para>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><replaceable class="parameter">option_name</replaceable></term>
         <listitem>
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 560ec974fa..18f6e09cfd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -403,6 +403,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		List	   *pubnames;
 		char	   *pubnames_literal;
 
+		/* Add SHUTDOWN_MODE option if needed */
+		if (options->shutdown_mode &&
+			PQserverVersion(conn->streamConn) >= 160000)
+			appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
+							 options->shutdown_mode);
+
 		appendStringInfoString(&cmd, " (");
 
 		appendStringInfo(&cmd, "proto_version '%u'",
@@ -449,6 +455,7 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, " TIMELINE %u",
 						 options->proto.physical.startpointTLI);
 
+
 	/* Start streaming. */
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e52143b588..5c5f28bf0d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4040,7 +4040,9 @@ maybe_reread_subscription(void)
 		newsub->stream != MySubscription->stream ||
 		strcmp(newsub->origin, MySubscription->origin) != 0 ||
 		newsub->owner != MySubscription->owner ||
-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		/* minapplydelay affects SHUTDOWN_MODE option */
+		(newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))
 	{
 		if (am_parallel_apply_worker())
 			ereport(LOG,
@@ -4724,6 +4726,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	options.shutdown_mode = NULL;
 
 	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
@@ -4762,6 +4765,14 @@ ApplyWorkerMain(Datum main_arg)
 
 	if (!am_tablesync_worker())
 	{
+		/*
+		 * time-delayed logical replication does not support tablesync
+		 * workers, so only the leader apply worker can request walsenders to
+		 * exit before confirming remote flush.
+		 */
+		if (server_version >= 160000 && MySubscription->minapplydelay > 0)
+			options.shutdown_mode = pstrdup("immediate");
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..54450a041a 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,6 +76,7 @@ Node *replication_parse_result;
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
+%token K_SHUTDOWN_MODE
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -91,6 +92,7 @@ Node *replication_parse_result;
 %type <boolval>	opt_temporary
 %type <list>	create_slot_options create_slot_legacy_opt_list
 %type <defelt>	create_slot_legacy_opt
+%type <str>	opt_shutdown_mode
 
 %%
 
@@ -270,20 +272,22 @@ start_replication:
 					cmd->slotname = $2;
 					cmd->startpoint = $4;
 					cmd->timeline = $5;
+					cmd->shutdownmode = NULL;
 					$$ = (Node *) cmd;
 				}
 			;
 
 /* START_REPLICATION SLOT slot LOGICAL %X/%X options */
 start_logical_replication:
-			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
+			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options
 				{
 					StartReplicationCmd *cmd;
 					cmd = makeNode(StartReplicationCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $3;
 					cmd->startpoint = $5;
-					cmd->options = $6;
+					cmd->shutdownmode = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -336,6 +340,10 @@ opt_timeline:
 				| /* EMPTY */			{ $$ = 0; }
 			;
 
+opt_shutdown_mode:
+			K_SHUTDOWN_MODE SCONST			{ $$ = $2; }
+			| /* EMPTY */					{ $$ = NULL; }
+		;
 
 plugin_options:
 			'(' plugin_opt_list ')'			{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..fcc6f6feda 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
 WAIT				{ return K_WAIT; }
+SHUTDOWN_MODE		{ return K_SHUTDOWN_MODE; }
 
 {space}+		{ /* do nothing */ }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..cfce9d93ef 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -409,6 +409,7 @@ WalReceiverMain(void)
 		options.logical = false;
 		options.startpoint = startpoint;
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
+		options.shutdown_mode = NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		if (walrcv_startstreaming(wrconn, &options))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4ed3747e3f..65d08bdc95 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -219,6 +219,25 @@ typedef struct
 
 static LagTracker *lag_tracker;
 
+/* Indicator for specifying the shutdown mode */
+typedef enum
+{
+	WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+	WALSND_SHUTDOWN_MODE_IMMIDEATE
+} WalSndShutdownMode;
+
+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+	WalSndShutdownMode shutdown_mode;
+} WalSndOptions;
+
+static WalSndOptions *my_options = NULL;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -260,6 +279,8 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+static void CheckWalSndOptions(const StartReplicationCmd *cmd);
+static void ParseShutdownMode(char *shutdownmode);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -1272,6 +1293,12 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/* Initialize an option holder */
+	my_options = (WalSndOptions *) palloc0(sizeof(WalSndOptions));
+
+	/* Check given options and set value to the holder */
+	CheckWalSndOptions(cmd);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1450,6 +1477,16 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/*
+		 * In this function, there is a possibility that the walsender is
+		 * stuck. It is caused when the opposite worker is stuck and then the
+		 * send-buffer of the walsender becomes full. Therefore, we must add
+		 * an additional path for shutdown for immediate shutdown mode.
+		 */
+		if (my_options->shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE &&
+			got_STOPPING)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -3114,19 +3151,26 @@ WalSndDone(WalSndSendDataCallback send_data)
 	 * To figure out whether all WAL has successfully been replicated, check
 	 * flush location if valid, write otherwise. Tools like pg_receivewal will
 	 * usually (unless in synchronous mode) return an invalid flush location.
+	 *
+	 * If we are in the immediate shutdown mode, flush location and output
+	 * buffer is not checked. This may break the consistency between nodes,
+	 * but it may be useful for the system that has high-latency network to
+	 * reduce the amount of time for shutdown.
 	 */
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	if (WalSndCaughtUp &&
+		((my_options &&
+		  my_options->shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE) ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
@@ -3849,3 +3893,33 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Check options for walsender itself and set a value to an option holder.
+ *
+ * Currently only one option is accepted.
+ */
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+	if (cmd->shutdownmode)
+		ParseShutdownMode(cmd->shutdownmode);
+}
+
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+	if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+		my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+	else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+		my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+	else
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+				errmsg("SHUTDOWN_MODE requires \"wait_flush\" or \"immediate\""));
+}
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..c96e85e859 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -83,6 +83,7 @@ typedef struct StartReplicationCmd
 	char	   *slotname;
 	TimeLineID	timeline;
 	XLogRecPtr	startpoint;
+	char	   *shutdownmode;
 	List	   *options;
 } StartReplicationCmd;
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index decffe352d..ef6297da52 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -170,6 +170,7 @@ typedef struct
 								 * false if physical stream.  */
 	char	   *slotname;		/* Name of the replication slot or NULL. */
 	XLogRecPtr	startpoint;		/* LSN of starting point. */
+	char	   *shutdown_mode;	/* Name of specified shutdown name */
 
 	union
 	{
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 75fd77b891..0d856e9bd4 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -523,10 +523,15 @@ $node_publisher->poll_query_until('postgres',
 # inserted on the publisher, and b) when those changes are replicated on the
 # subscriber. Even on slow machines, this strategy will give predictable behavior.
 
-# Set min_apply_delay parameter to 3 seconds
+# Check restart on changing min_apply_delay to 3 seconds
 my $delay = 3;
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for apply to restart after changing min_apply_delay to non-zero value";
 
 # Before doing the insertion, get the current timestamp that will be
 # used as a comparison base.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3224dfc36..837cbf8d9d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2969,7 +2969,9 @@ WalReceiverConn
 WalReceiverFunctionsType
 WalSnd
 WalSndCtlData
+WalSndOptions
 WalSndSendDataCallback
+WalSndShutdownMode
 WalSndState
 WalTimeSample
 WalUsage
-- 
2.27.0

#48Takamichi Osumi (Fujitsu)
osumi.takamichi@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#47)
RE: Exit walsender before confirming remote flush in logical replication

On Wednesday, February 8, 2023 6:47 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:

PSA rebased patch that supports updated time-delayed patch[1].

Hi,

Thanks for creating the patch ! Minor review comments on v5-0002.

(1)

+          Decides the condition for exiting the walsender process.
+          <literal>'wait_flush'</literal>, which is the default, the walsender
+          will wait for all the sent WALs to be flushed on the subscriber side,
+          before exiting the process. <literal>'immediate'</literal> will exit
+          without confirming the remote flush. This may break the consistency
+          between publisher and subscriber, but it may be useful for a system
+          that has a high-latency network to reduce the amount of time for
+          shutdown.

(1-1)

The first part "exiting the walsender process" can be improved.
Probably, you can say "the exiting walsender process" or
"Decides the behavior of the walsender process at shutdown" instread.

(1-2)

Also, the next sentence can be improved something like
"If the shutdown mode is wait_flush, which is the default, the
walsender waits for all the sent WALs to be flushed on the subscriber side.
If it is immediate, the walsender exits without confirming the remote flush".

(1-3)

We don't need to wrap wait_flush and immediate by single quotes
within the literal tag.

(2)

+ /* minapplydelay affects SHUTDOWN_MODE option */

I think we can move this comment to just above the 'if' condition
and combine it with the existing 'if' conditions comments.

(3) 001_rep_changes.pl

(3-1) Question

In general, do we add this kind of check when we extend the protocol (STREAM_REPLICATION command)
or add a new condition for apply worker exit ?
In case when we would like to know the restart of the walsender process in TAP tests,
then could you tell me why the new test code matches the purpose of this patch ?

(3-2)

+ "Timed out while waiting for apply to restart after changing min_apply_delay to non-zero value";

Probably, we can partly change this sentence like below, because we check walsender's pid.
FROM: "... while waiting for apply to restart..."
TO: "... while waiting for the walsender to restart..."

Best Regards,
Takamichi Osumi

#49Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#47)
Re: Exit walsender before confirming remote flush in logical replication

Hi,

On Wed, Feb 8, 2023 at 6:47 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Horiguchi-san,

Thank you for checking the patch! PSA new version.

PSA rebased patch that supports updated time-delayed patch[1].

Thank you for the patch! Here are some comments on v5 patch:

+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+        WalSndShutdownMode shutdown_mode;
+} WalSndOptions;
+
+static WalSndOptions *my_options = NULL;

I'm not sure we need to have it as a struct at this stage since we
support only one option. I wonder if we can have one value, say
shutdown_mode, and we can make it a struct when we really need it.
Even if we use WalSndOptions struct, I don't think we need to
dynamically allocate it. Since a walsender can start logical
replication multiple times in principle, my_options is not freed.

---
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+        if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+                my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+        else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+                my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+        else
+                ereport(ERROR,
+                                errcode(ERRCODE_SYNTAX_ERROR),
+                                errmsg("SHUTDOWN_MODE requires
\"wait_flush\" or \"immediate\""));
+}

I think we should make the error message consistent with other enum
parameters. How about the message like:

ERROR: invalid value shutdown mode: "%s"

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#50Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Takamichi Osumi (Fujitsu) (#48)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Osumi-san,

Thank you for reviewing! PSA new version.

(1)

+          Decides the condition for exiting the walsender process.
+          <literal>'wait_flush'</literal>, which is the default, the walsender
+          will wait for all the sent WALs to be flushed on the subscriber side,
+          before exiting the process. <literal>'immediate'</literal> will exit
+          without confirming the remote flush. This may break the consistency
+          between publisher and subscriber, but it may be useful for a system
+          that has a high-latency network to reduce the amount of time for
+          shutdown.

(1-1)

The first part "exiting the walsender process" can be improved.
Probably, you can say "the exiting walsender process" or
"Decides the behavior of the walsender process at shutdown" instread.

Fixed. Second idea was chosen.

(1-2)

Also, the next sentence can be improved something like
"If the shutdown mode is wait_flush, which is the default, the
walsender waits for all the sent WALs to be flushed on the subscriber side.
If it is immediate, the walsender exits without confirming the remote flush".

Fixed.

(1-3)

We don't need to wrap wait_flush and immediate by single quotes
within the literal tag.

This style was ported from the SNAPSHOT options part, so I decided to keep.

(2)

+ /* minapplydelay affects SHUTDOWN_MODE option */

I think we can move this comment to just above the 'if' condition
and combine it with the existing 'if' conditions comments.

Moved and added some comments.

(3) 001_rep_changes.pl

(3-1) Question

In general, do we add this kind of check when we extend the protocol
(STREAM_REPLICATION command)
or add a new condition for apply worker exit ?
In case when we would like to know the restart of the walsender process in TAP
tests,
then could you tell me why the new test code matches the purpose of this patch ?

The replication command is not for normal user, so I think we don't have to test itself.

The check that waits to restart the apply worker was added to improve the robustness.
I think there is a possibility to fail the test when the apply worker recevies a transaction
before it checks new subscription option. Now the failure can be avoided by
confriming to reload pg_subscription and restart.

(3-2)

+ "Timed out while waiting for apply to restart after changing min_apply_delay
to non-zero value";

Probably, we can partly change this sentence like below, because we check
walsender's pid.
FROM: "... while waiting for apply to restart..."
TO: "... while waiting for the walsender to restart..."

Right, fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v6-0001-Time-delayed-logical-replication-subscriber.patchapplication/octet-stream; name=v6-0001-Time-delayed-logical-replication-subscriber.patchDownload
From 108ae2d9a497cc4bb5c1038c9a1fac939721f047 Mon Sep 17 00:00:00 2001
From: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Date: Thu, 9 Feb 2023 09:05:23 +0000
Subject: [PATCH v6 1/2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

This patch implements a new subscription parameter called 'min_apply_delay'.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs before we start to apply the transaction on the
subscriber. The main reason is to avoid keeping a transaction open for
a long time. Regular and prepared transactions are covered. Streamed
transactions are also covered.

The combination of parallel streaming mode and min_apply_delay is not
allowed. This is because in parallel streaming mode, we start applying
the transaction stream as soon as the first change arrives without
knowing the transaction's prepare/commit time. This means we cannot
calculate the underlying network/decoding lag between publisher and
subscriber, and so always waiting for the full 'min_apply_delay' period
might include unnecessary delay.

The other possibility was to apply the delay at the end of the parallel
apply transaction but that would cause issues related to resource
bloat and locks being held for a long time.

Note that this feature doesn't interact with skip transaction feature.
The skip transaction feature applies to one transaction with a specific LSN.
So, even if the skipped transaction and non-skipped transaction come
consecutively in a very short time, regardless of the order of which comes
first, the time-delayed feature gets balanced by delayed application
for other transactions before and after the skipped transaction.

Author: Euler Taveira, Takamichi Osumi, Kuroda Hayato
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Shveta Malik,
             Kyotaro Horiguchi, Shi Yu, Wang Wei, Dilip Kumar, Melih Mutlu
Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   9 +
 doc/src/sgml/config.sgml                      |  12 ++
 doc/src/sgml/glossary.sgml                    |  15 ++
 doc/src/sgml/logical-replication.sgml         |   6 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  49 ++++-
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/catalog/system_views.sql          |   7 +-
 src/backend/commands/subscriptioncmds.c       | 122 +++++++++++-
 .../replication/logical/applyparallelworker.c |   3 +-
 src/backend/replication/logical/worker.c      | 166 ++++++++++++++--
 src/bin/pg_dump/pg_dump.c                     |  15 +-
 src/bin/pg_dump/pg_dump.h                     |   1 +
 src/bin/psql/describe.c                       |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/replication/worker_internal.h     |   2 +-
 src/test/regress/expected/subscription.out    | 181 +++++++++++-------
 src/test/regress/sql/subscription.sql         |  24 +++
 src/test/subscription/t/001_rep_changes.pl    |  28 +++
 20 files changed, 558 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5dc5ca1133 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7873,6 +7873,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subminapplydelay</structfield> <type>int4</type>
+      </para>
+      <para>
+       The minimum delay, in milliseconds, for applying changes
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subname</structfield> <type>name</type>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8c56b134a8..21b45c68e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,18 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        the <filename>postgresql.conf</filename> file or on the server
        command line.
       </para>
+      <para>
+       For time-delayed logical replication, the apply worker sends a feedback
+       message to the publisher every
+       <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+       to set <varname>wal_receiver_status_interval</varname> less than the
+       <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+       <literal>walsender</literal> will repeatedly terminate due to timeout
+       errors. Note that if <varname>wal_receiver_status_interval</varname> is
+       set to zero, the apply worker sends no feedback messages during the
+       <literal>min_apply_delay</literal> period. Refer to
+       <xref linkend="sql-createsubscription"/> for more information.
+      </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..9ede9d05f6 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1729,6 +1729,21 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-time-delayed-replication">
+   <glossterm>Time-delayed replication</glossterm>
+   <glossdef>
+    <para>
+     Replication setup that delays the application of changes by a specified
+     minimum time-delay period.
+    </para>
+    <para>
+     For more information, see
+     <xref linkend="guc-recovery-min-apply-delay"/> for physical replication
+     and <xref linkend="sql-createsubscription"/> for logical replication.
+    </para>
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-toast">
    <glossterm>TOAST</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6bd5f61e2b 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -247,6 +247,12 @@
    target table.
   </para>
 
+  <para>
+   A subscription can delay the application of changes by specifying the
+   <literal>min_apply_delay</literal> subscription parameter. See
+   <xref linkend="sql-createsubscription"/> for details.
+  </para>
+
   <sect2 id="logical-replication-subscription-slot">
    <title>Replication Slot Management</title>
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..8b7eb28e54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -213,8 +213,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
       <literal>binary</literal>, <literal>streaming</literal>,
-      <literal>disable_on_error</literal>, and
-      <literal>origin</literal>.
+      <literal>disable_on_error</literal>,
+      <literal>origin</literal>, and
+      <literal>min_apply_delay</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..1b4b8390af 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -349,7 +349,49 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
-      </variablelist></para>
+
+       <varlistentry>
+        <term><literal>min_apply_delay</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          By default, the subscriber applies changes as soon as possible. This
+          parameter allows the user to delay the application of changes by a
+          given time period. If the value is specified without units, it is
+          taken as milliseconds. The default is zero (no delay). See
+          <xref linkend="config-setting-names-values"/> for details on the
+          available valid time units.
+         </para>
+         <para>
+          Any delay becomes effective only after all initial table
+          synchronization has finished and occurs before each transaction starts
+          to get applied on the subscriber. The delay is calculated as the
+          difference between the WAL timestamp as written on the publisher and
+          the current time on the subscriber. Any overhead of time spent in
+          logical decoding and in transferring the transaction may reduce the
+          actual wait time. It is also possible that the overhead already
+          exceeds the requested <literal>min_apply_delay</literal> value, in
+          which case no delay is applied. If the system clocks on publisher and
+          subscriber are not synchronized, this may lead to apply changes
+          earlier than expected, but this is not a major issue because this
+          parameter is typically much larger than the time deviations between
+          servers. Note that if this parameter is set to a long delay, the
+          replication will stop if the replication slot falls behind the current
+          LSN by more than
+          <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>.
+         </para>
+         <warning>
+           <para>
+            Delaying the replication means there is a much longer time between
+            making a change on the publisher, and that change being committed
+            on the subscriber. This can impact the performance of synchronous
+            replication. See <xref linkend="guc-synchronous-commit"/>
+            parameter.
+           </para>
+         </warning>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
 
     </listitem>
    </varlistentry>
@@ -420,6 +462,11 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
    published with different column lists are not supported.
   </para>
 
+  <para>
+   A non-zero <literal>min_apply_delay</literal> parameter is not allowed when
+   streaming in parallel mode.
+  </para>
+
   <para>
    We allow non-existent publications to be specified so that users can add
    those later. This means
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a56ae311c3..e19e5cbca2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->skiplsn = subform->subskiplsn;
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
+	sub->minapplydelay = subform->subminapplydelay;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..317c2010cb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1299,9 +1299,10 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-              subbinary, substream, subtwophasestate, subdisableonerr,
-              subslotname, subsynccommit, subpublications, suborigin)
+GRANT SELECT (oid, subdbid, subskiplsn, subminapplydelay, subname, subowner,
+              subenabled, subbinary, substream, subtwophasestate,
+              subdisableonerr, subslotname, subsynccommit, subpublications,
+              suborigin)
     ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..82e16fd0f9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,6 +66,7 @@
 #define SUBOPT_DISABLE_ON_ERR		0x00000400
 #define SUBOPT_LSN					0x00000800
 #define SUBOPT_ORIGIN				0x00001000
+#define SUBOPT_MIN_APPLY_DELAY		0x00002000
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -90,6 +91,7 @@ typedef struct SubOpts
 	bool		disableonerr;
 	char	   *origin;
 	XLogRecPtr	lsn;
+	int32		min_apply_delay;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -100,7 +102,7 @@ static void check_publications_origin(WalReceiverConn *wrconn,
 static void check_duplicates_in_publist(List *publist, Datum *datums);
 static List *merge_publications(List *oldpublist, List *newpublist, bool addpub, const char *subname);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
-
+static int32 defGetMinApplyDelay(DefElem *def);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -146,6 +148,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->disableonerr = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY))
+		opts->min_apply_delay = 0;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -324,6 +328,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_LSN;
 			opts->lsn = lsn;
 		}
+		else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+				 strcmp(defel->defname, "min_apply_delay") == 0)
+		{
+			if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				errorConflictingDefElem(defel, pstate);
+
+			opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY;
+			opts->min_apply_delay = defGetMinApplyDelay(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -404,6 +417,32 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 								"slot_name = NONE", "create_slot = false")));
 		}
 	}
+
+	/*
+	 * The combination of parallel streaming mode and min_apply_delay is not
+	 * allowed. This is because in parallel streaming mode, we start applying
+	 * the transaction stream as soon as the first change arrives without
+	 * knowing the transaction's prepare/commit time. This means we cannot
+	 * calculate the underlying network/decoding lag between publisher and
+	 * subscriber, and so always waiting for the full 'min_apply_delay' period
+	 * might include unnecessary delay.
+	 *
+	 * The other possibility was to apply the delay at the end of the parallel
+	 * apply transaction but that would cause issues related to resource bloat
+	 * and locks being held for a long time.
+	 */
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+		opts->min_apply_delay > 0 &&
+		opts->streaming == LOGICALREP_STREAM_PARALLEL)
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+
+		/*
+		 * translator: the first %s is a string of the form "parameter > 0"
+		 * and the second one is "option = value".
+		 */
+				errmsg("%s and %s are mutually exclusive options",
+					   "min_apply_delay > 0", "streaming = parallel"));
 }
 
 /*
@@ -560,7 +599,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
+					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN |
+					  SUBOPT_MIN_APPLY_DELAY);
 	parse_subscription_options(pstate, stmt->options, supported_opts, &opts);
 
 	/*
@@ -625,6 +665,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
+	values[Anum_pg_subscription_subminapplydelay - 1] = Int32GetDatum(opts.min_apply_delay);
 	values[Anum_pg_subscription_subname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
@@ -1054,7 +1095,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 				supported_opts = (SUBOPT_SLOT_NAME |
 								  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 								  SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR |
-								  SUBOPT_ORIGIN);
+								  SUBOPT_ORIGIN | SUBOPT_MIN_APPLY_DELAY);
 
 				parse_subscription_options(pstate, stmt->options,
 										   supported_opts, &opts);
@@ -1098,6 +1139,19 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.streaming == LOGICALREP_STREAM_PARALLEL &&
+						!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)
+						&& sub->minapplydelay > 0)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set parallel streaming mode for subscription with %s",
+									   "min_apply_delay"));
+
 					values[Anum_pg_subscription_substream - 1] =
 						CharGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -1111,6 +1165,26 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 						= true;
 				}
 
+				if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.min_apply_delay > 0 &&
+						!IsSet(opts.specified_opts, SUBOPT_STREAMING)
+						&& sub->stream == LOGICALREP_STREAM_PARALLEL)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set %s for subscription in parallel streaming mode",
+									   "min_apply_delay"));
+
+					values[Anum_pg_subscription_subminapplydelay - 1] =
+						Int32GetDatum(opts.min_apply_delay);
+					replaces[Anum_pg_subscription_subminapplydelay - 1] = true;
+				}
+
 				if (IsSet(opts.specified_opts, SUBOPT_ORIGIN))
 				{
 					values[Anum_pg_subscription_suborigin - 1] =
@@ -2195,3 +2269,45 @@ defGetStreamingMode(DefElem *def)
 					def->defname)));
 	return LOGICALREP_STREAM_OFF;	/* keep compiler quiet */
 }
+
+/*
+ * Extract the min_apply_delay value from a DefElem. This is very similar to
+ * parse_and_validate_value() for integer values, because min_apply_delay
+ * accepts the same parameter format as recovery_min_apply_delay.
+ */
+static int32
+defGetMinApplyDelay(DefElem *def)
+{
+	char	   *input_string;
+	int			result;
+	const char *hintmsg;
+
+	input_string = defGetString(def);
+
+	/*
+	 * Parse given string as parameter which has millisecond unit
+	 */
+	if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for parameter \"%s\": \"%s\"",
+						"min_apply_delay", input_string),
+				 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+
+	/*
+	 * Check both the lower boundary for the valid min_apply_delay range and
+	 * the upper boundary as the safeguard for some platforms where INT_MAX is
+	 * wider than int32 respectively. Although parse_int() has confirmed that
+	 * the result is less than or equal to INT_MAX, the value will be stored
+	 * in a catalog column of int32.
+	 */
+	if (result < 0 || result > PG_INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)",
+						result,
+						"min_apply_delay",
+						0, PG_INT32_MAX)));
+
+	return result;
+}
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index da437e0bc3..32db20fd98 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void)
 	{
 		apply_spooled_messages(&MyParallelShared->fileset,
 							   MyParallelShared->xid,
-							   InvalidXLogRecPtr);
+							   InvalidXLogRecPtr,
+							   0);
 		pa_set_fileset_state(MyParallelShared, FS_EMPTY);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6248..19b0574ad0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -319,6 +319,17 @@ static List *on_commit_wakeup_workers_subids = NIL;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/*
+ * In order to avoid walsender timeout for time-delayed logical replication the
+ * apply worker keeps sending feedback messages during the delay period.
+ * Meanwhile, the feature delays the apply before the start of the
+ * transaction and thus we don't write WAL records for the suspended changes
+ * during the wait. When the apply worker sends a feedback message during the
+ * delay, we should not overwrite positions of the flushed and apply LSN by the
+ * last received latest LSN. See send_feedback() for details.
+ */
+static XLogRecPtr last_received = InvalidXLogRecPtr;
+
 /* fields valid only when processing streamed transaction */
 static bool in_streamed_transaction = false;
 
@@ -389,7 +400,8 @@ static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
 static void stream_close_file(void);
 
-static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+						  bool has_unprocessed_change);
 
 static void DisableSubscriptionAndExit(void);
 
@@ -999,6 +1011,110 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 	ExecStoreVirtualTuple(slot);
 }
 
+/*
+ * When min_apply_delay parameter is set on the subscriber, we wait long enough
+ * to make sure a transaction is applied at least that period behind the
+ * publisher.
+ *
+ * While the physical replication applies the delay at commit time, this
+ * feature applies the delay for the next transaction but before starting the
+ * transaction. This is mainly because keeping a transaction that conducted
+ * write operations open for a long time results in some issues such as bloat
+ * and locks.
+ *
+ * The min_apply_delay parameter will take effect only after all tables are in
+ * READY state.
+ *
+ * xid is the transaction id where we apply the delay.
+ *
+ * finish_ts is the commit/prepare time of both regular (non-streamed) and
+ * streamed transactions. Unlike the regular (non-streamed) cases, the delay
+ * is applied in a STREAM COMMIT/STREAM PREPARE message for streamed
+ * transactions. The STREAM START message does not contain a commit/prepare
+ * time (it will be available when the in-progress transaction finishes).
+ * Hence, it's not appropriate to apply a delay at the STREAM START time.
+ */
+static void
+maybe_apply_delay(TransactionId xid, TimestampTz finish_ts)
+{
+	Assert(finish_ts > 0);
+
+	/* Nothing to do if no delay set */
+	if (!MySubscription->minapplydelay)
+		return;
+
+	/*
+	 * The min_apply_delay parameter is ignored until all tablesync workers
+	 * have reached READY state. This is because if we allowed the delay
+	 * during the catchup phase, then once we reached the limit of tablesync
+	 * workers it would impose a delay for each subsequent worker. That would
+	 * cause initial table synchronization completion to take a long time.
+	 */
+	if (!AllTablesyncsReady())
+		return;
+
+	/* Apply the delay by the latch mechanism */
+	while (true)
+	{
+		TimestampTz delayUntil;
+		long		diffms;
+		long		status_interval_ms;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* This might change wal_receiver_status_interval */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/*
+		 * Before calculating the time duration, reload the catalog if needed.
+		 */
+		if (!in_remote_transaction && !in_streamed_transaction)
+		{
+			AcceptInvalidationMessages();
+			maybe_reread_subscription();
+		}
+
+		delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay);
+		diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+		/*
+		 * Exit without arming the latch if it's already past time to apply
+		 * this transaction.
+		 */
+		if (diffms <= 0)
+			break;
+
+		elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms",
+			 xid, MySubscription->minapplydelay, diffms);
+
+		/*
+		 * Call send_feedback() to prevent the publisher from exiting by
+		 * timeout during the delay, when the status interval is greater than
+		 * zero.
+		 */
+		status_interval_ms = wal_receiver_status_interval * 1000L;
+		if (status_interval_ms > 0 && diffms > status_interval_ms)
+		{
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  status_interval_ms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+			send_feedback(last_received, true, false, true);
+		}
+		else
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  diffms,
+					  WAIT_EVENT_RECOVERY_APPLY_DELAY);
+	}
+}
+
 /*
  * Handle BEGIN message.
  */
@@ -1013,6 +1129,9 @@ apply_handle_begin(StringInfo s)
 	logicalrep_read_begin(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
 
+	/* Should we delay the current transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.committime);
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	maybe_start_skipping_changes(begin_data.final_lsn);
@@ -1070,6 +1189,9 @@ apply_handle_begin_prepare(StringInfo s)
 	logicalrep_read_begin_prepare(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
 
+	/* Should we delay the current prepared transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.prepare_time);
+
 	remote_final_lsn = begin_data.prepare_lsn;
 
 	maybe_start_skipping_changes(begin_data.prepare_lsn);
@@ -1317,7 +1439,8 @@ apply_handle_stream_prepare(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
-								   prepare_data.xid, prepare_data.prepare_lsn);
+								   prepare_data.xid, prepare_data.prepare_lsn,
+								   prepare_data.prepare_time);
 
 			/* Mark the transaction as prepared. */
 			apply_handle_prepare_internal(&prepare_data);
@@ -2011,10 +2134,13 @@ ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno,
 
 /*
  * Common spoolfile processing.
+ *
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */
 void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-					   XLogRecPtr lsn)
+					   XLogRecPtr lsn, TimestampTz finish_ts)
 {
 	StringInfoData s2;
 	int			nchanges;
@@ -2025,6 +2151,10 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
 	int			fileno;
 	off_t		offset;
 
+	/* Should we delay the current transaction? */
+	if (finish_ts)
+		maybe_apply_delay(xid, finish_ts);
+
 	if (!am_parallel_apply_worker())
 		maybe_start_skipping_changes(lsn);
 
@@ -2174,7 +2304,7 @@ apply_handle_stream_commit(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
-								   commit_data.commit_lsn);
+								   commit_data.commit_lsn, commit_data.committime);
 
 			apply_handle_commit_internal(&commit_data);
 
@@ -3447,7 +3577,7 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
  * Apply main loop.
  */
 static void
-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)
 {
 	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
 	bool		ping_sent = false;
@@ -3568,7 +3698,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						if (last_received < end_lsn)
 							last_received = end_lsn;
 
-						send_feedback(last_received, reply_requested, false);
+						send_feedback(last_received, reply_requested, false, false);
 						UpdateWorkerStats(last_received, timestamp, true);
 					}
 					/* other message types are purposefully ignored */
@@ -3581,7 +3711,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		}
 
 		/* confirm all writes so far */
-		send_feedback(last_received, false, false);
+		send_feedback(last_received, false, false, false);
 
 		if (!in_remote_transaction && !in_streamed_transaction)
 		{
@@ -3678,7 +3808,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				}
 			}
 
-			send_feedback(last_received, requestReply, requestReply);
+			send_feedback(last_received, requestReply, requestReply, false);
 
 			/*
 			 * Force reporting to ensure long idle periods don't lead to
@@ -3708,7 +3838,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  * to send a response to avoid timeouts.
  */
 static void
-send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
+send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, bool has_unprocessed_change)
 {
 	static StringInfo reply_message = NULL;
 	static TimestampTz send_time = 0;
@@ -3738,8 +3868,14 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	/*
 	 * No outstanding transactions to flush, we can report the latest received
 	 * position. This is important for synchronous replication.
+	 *
+	 * If the logical replication subscription has unprocessed changes then do
+	 * not inform the publisher that the received latest LSN is already
+	 * applied and flushed, otherwise, the publisher will make a wrong
+	 * assumption about the logical replication progress. Instead, just send a
+	 * feedback message to avoid a replication timeout during the delay.
 	 */
-	if (!have_pending_txes)
+	if (!have_pending_txes && !has_unprocessed_change)
 		flushpos = writepos = recvpos;
 
 	if (writepos < last_writepos)
@@ -3776,8 +3912,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
-	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
+	elog(DEBUG2, "sending feedback (force %d, has_unprocessed_change %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
+		 has_unprocessed_change,
 		 LSN_FORMAT_ARGS(recvpos),
 		 LSN_FORMAT_ARGS(writepos),
 		 LSN_FORMAT_ARGS(flushpos));
@@ -4367,11 +4504,11 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
  * of system resource error and are not repeatable.
  */
 static void
-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)
 {
 	PG_TRY();
 	{
-		LogicalRepApplyLoop(origin_startpos);
+		LogicalRepApplyLoop();
 	}
 	PG_CATCH();
 	{
@@ -4661,7 +4798,8 @@ ApplyWorkerMain(Datum main_arg)
 	}
 
 	/* Run the main loop. */
-	start_apply(origin_startpos);
+	last_received = origin_startpos;
+	start_apply();
 
 	proc_exit(0);
 }
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..1e87f0124e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4494,6 +4494,7 @@ getSubscriptions(Archive *fout)
 	int			i_subsynccommit;
 	int			i_subpublications;
 	int			i_subbinary;
+	int			i_subminapplydelay;
 	int			i,
 				ntups;
 
@@ -4546,9 +4547,13 @@ getSubscriptions(Archive *fout)
 						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	if (fout->remoteVersion >= 160000)
-		appendPQExpBufferStr(query, " s.suborigin\n");
+		appendPQExpBufferStr(query,
+							 " s.suborigin,\n"
+							 " s.subminapplydelay\n");
 	else
-		appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY);
+		appendPQExpBuffer(query, " '%s' AS suborigin,\n"
+						  " 0 AS subminapplydelay\n",
+						  LOGICALREP_ORIGIN_ANY);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4576,6 +4581,7 @@ getSubscriptions(Archive *fout)
 	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 	i_subdisableonerr = PQfnumber(res, "subdisableonerr");
 	i_suborigin = PQfnumber(res, "suborigin");
+	i_subminapplydelay = PQfnumber(res, "subminapplydelay");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4606,6 +4612,8 @@ getSubscriptions(Archive *fout)
 		subinfo[i].subdisableonerr =
 			pg_strdup(PQgetvalue(res, i, i_subdisableonerr));
 		subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin));
+		subinfo[i].subminapplydelay =
+			atoi(PQgetvalue(res, i, i_subminapplydelay));
 
 		/* Decide whether we want to dump it */
 		selectDumpableObject(&(subinfo[i].dobj), fout);
@@ -4687,6 +4695,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
+	if (subinfo->subminapplydelay > 0)
+		appendPQExpBuffer(query, ", min_apply_delay = '%d ms'", subinfo->subminapplydelay);
+
 	appendPQExpBufferStr(query, ");\n");
 
 	if (subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..b8831c3ed3 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo
 	char	   *subdisableonerr;
 	char	   *suborigin;
 	char	   *subsynccommit;
+	int			subminapplydelay;
 	char	   *subpublications;
 } SubscriptionInfo;
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c8a0bb7b3a..81d4607a1c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6472,7 +6472,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false, false, false, false, false};
+	false, false, false, false, false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6527,10 +6527,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Two-phase commit"),
 							  gettext_noop("Disable on error"));
 
+		/* Origin and min_apply_delay are only supported in v16 and higher */
 		if (pset.sversion >= 160000)
 			appendPQExpBuffer(&buf,
-							  ", suborigin AS \"%s\"\n",
-							  gettext_noop("Origin"));
+							  ", suborigin AS \"%s\"\n"
+							  ", subminapplydelay AS \"%s\"\n",
+							  gettext_noop("Origin"),
+							  gettext_noop("Min apply delay"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e1882eaea..e8b9a43a47 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1925,7 +1925,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "PUBLICATION");
 	/* ALTER SUBSCRIPTION <name> SET ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SET", "("))
-		COMPLETE_WITH("binary", "disable_on_error", "origin", "slot_name",
+		COMPLETE_WITH("binary", "disable_on_error", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit");
 	/* ALTER SUBSCRIPTION <name> SKIP ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SKIP", "("))
@@ -3268,7 +3268,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-					  "disable_on_error", "enabled", "origin", "slot_name",
+					  "disable_on_error", "enabled", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index b0f2a1705d..d1cfefc6d6 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	Oid			subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
 
+	int32		subminapplydelay;	/* Replication apply delay (ms) */
+
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
@@ -122,6 +124,7 @@ typedef struct Subscription
 								 * skipped */
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
+	int32		minapplydelay;	/* Replication apply delay (ms) */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index dc87a4edd1..3dc09d1a4c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -255,7 +255,7 @@ extern void stream_stop_internal(TransactionId xid);
 
 /* Common streaming function to apply all the spooled messages */
 extern void apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-								   XLogRecPtr lsn);
+								   XLogRecPtr lsn, TimestampTz finish_ts);
 
 extern void apply_dispatch(StringInfo s);
 
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 3f99b14394..cf8e727ee9 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -114,18 +114,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION 'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub4 SET (origin = any);
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub3;
@@ -143,10 +143,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -163,10 +163,10 @@ ERROR:  unrecognized subscription parameter: "create_slot"
 -- ok
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/12345');
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/12345
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/12345
 (1 row)
 
 -- ok - with lsn = NONE
@@ -175,10 +175,10 @@ ALTER SUBSCRIPTION regress_testsub SKIP (lsn = NONE);
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/0');
 ERROR:  invalid WAL location (LSN): 0/0
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/0
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 BEGIN;
@@ -210,10 +210,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                                               List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
----------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | local              | dbname=regress_doesnotexist2 | 0/0
+                                                                                                        List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | local              | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 -- rename back to keep the rest simple
@@ -247,19 +247,19 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -271,27 +271,27 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication already exists
@@ -306,10 +306,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                                                 List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication used more than once
@@ -324,10 +324,10 @@ ERROR:  publication "testpub3" is not in subscription "regress_testsub"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -363,10 +363,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 --fail - alter of two_phase option not supported.
@@ -375,10 +375,10 @@ ERROR:  unrecognized subscription parameter: "two_phase"
 -- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -388,10 +388,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -404,20 +404,57 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+ERROR:  invalid value for parameter "min_apply_delay": "foo"
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+ERROR:  -1 ms is outside the valid range for parameter "min_apply_delay" (0 .. 2147483647)
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+ERROR:  min_apply_delay > 0 and streaming = parallel are mutually exclusive options
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |             123 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |        86400000 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+ERROR:  cannot set parallel streaming mode for subscription with min_apply_delay
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ERROR:  cannot set min_apply_delay for subscription in parallel streaming mode
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 7281f5fee2..7317b140f5 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -286,6 +286,30 @@ ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+\dRs+
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 91aa068c95..75fd77b891 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -515,6 +515,34 @@ $node_publisher->poll_query_until('postgres',
   or die
   "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
 
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# verify this by looking at the time difference between a) when tuples are
+# inserted on the publisher, and b) when those changes are replicated on the
+# subscriber. Even on slow machines, this strategy will give predictable behavior.
+
+# Set min_apply_delay parameter to 3 seconds
+my $delay = 3;
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+
+# Before doing the insertion, get the current timestamp that will be
+# used as a comparison base.
+my $publisher_insert_time = time();
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_ins VALUES (generate_series(1101, 1120))");
+
+# The publisher waits for the replication to complete
+$node_publisher->wait_for_catchup('tap_sub_renamed');
+
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+	"subscriber applies WAL only after replication delay for non-streaming transaction"
+);
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
-- 
2.27.0

v6-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchapplication/octet-stream; name=v6-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchDownload
From 3dec722f7a6ae063af55e71609f6429b3f19d675 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 8 Feb 2023 09:09:31 +0000
Subject: [PATCH v6 2/2] Extend START_REPLICATION command to accept walsender
 options

This commit extends START_REPLICATION to accept SHUTDOWN_MODE term. Currently,
it works well only for logical replication.

When 'wait_flush', which is the default, is specified, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before exiting the
process. 'immediate' will exit without confirming the remote flush. This may
break the consistency between publisher and subscriber, but it may be useful
for a system that has a high-latency network to reduce the amount of time for
shutdown. This may be useful to shut down the publisher even when the
worker is stuck.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/protocol.sgml                    | 15 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c       |  7 ++
 src/backend/replication/logical/worker.c      | 17 ++++-
 src/backend/replication/repl_gram.y           | 12 +++-
 src/backend/replication/repl_scanner.l        |  1 +
 src/backend/replication/walreceiver.c         |  1 +
 src/backend/replication/walsender.c           | 67 ++++++++++++++++++-
 src/include/nodes/replnodes.h                 |  1 +
 src/include/replication/walreceiver.h         |  1 +
 src/test/subscription/t/001_rep_changes.pl    |  7 +-
 src/tools/pgindent/typedefs.list              |  1 +
 11 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 93fc7167d4..43e71421de 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2500,7 +2500,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
     </varlistentry>
 
     <varlistentry id="protocol-replication-start-replication-slot-logical">
-     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
+     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>SHUTDOWN_MODE</literal> <replaceable class="parameter">shutdown_mode</replaceable> ] [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
      <listitem>
       <para>
        Instructs server to start streaming WAL for logical replication,
@@ -2555,6 +2555,19 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
         </listitem>
        </varlistentry>
 
+       <varlistentry>
+        <term><literal>SHUTDOWN_MODE { 'wait_flush' | 'immediate' }</literal></term>
+        <listitem>
+         <para>
+          Decides the behavior of the walsender process at shutdown. If the
+          shutdown mode is <literal>'wait_flush'</literal>, which is the
+          default, the walsender waits for all the sent WALs to be flushed
+          on the subscriber side. If it is <literal>'immediate'</literal>,
+          the walsender exits without confirming the remote flush.
+         </para>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><replaceable class="parameter">option_name</replaceable></term>
         <listitem>
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 560ec974fa..18f6e09cfd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -403,6 +403,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		List	   *pubnames;
 		char	   *pubnames_literal;
 
+		/* Add SHUTDOWN_MODE option if needed */
+		if (options->shutdown_mode &&
+			PQserverVersion(conn->streamConn) >= 160000)
+			appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
+							 options->shutdown_mode);
+
 		appendStringInfoString(&cmd, " (");
 
 		appendStringInfo(&cmd, "proto_version '%u'",
@@ -449,6 +455,7 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, " TIMELINE %u",
 						 options->proto.physical.startpointTLI);
 
+
 	/* Start streaming. */
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 19b0574ad0..967252fdbb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4023,10 +4023,15 @@ maybe_reread_subscription(void)
 
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
+	 *
 	 * The launcher will start a new worker but note that the parallel apply
 	 * worker won't restart if the streaming option's value is changed from
 	 * 'parallel' to any other value or the server decides not to stream the
 	 * in-progress transaction.
+	 *
+	 * minapplydelay affects SHUTDOWN_MODE option. 'immediate' shutdown mode
+	 * will be specified if it is set to non-zero, otherwise default mode will
+	 * be set.
 	 */
 	if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
 		strcmp(newsub->name, MySubscription->name) != 0 ||
@@ -4035,7 +4040,8 @@ maybe_reread_subscription(void)
 		newsub->stream != MySubscription->stream ||
 		strcmp(newsub->origin, MySubscription->origin) != 0 ||
 		newsub->owner != MySubscription->owner ||
-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		(newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))
 	{
 		if (am_parallel_apply_worker())
 			ereport(LOG,
@@ -4719,6 +4725,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	options.shutdown_mode = NULL;
 
 	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
@@ -4757,6 +4764,14 @@ ApplyWorkerMain(Datum main_arg)
 
 	if (!am_tablesync_worker())
 	{
+		/*
+		 * time-delayed logical replication does not support tablesync
+		 * workers, so only the leader apply worker can request walsenders to
+		 * exit before confirming remote flush.
+		 */
+		if (server_version >= 160000 && MySubscription->minapplydelay > 0)
+			options.shutdown_mode = pstrdup("immediate");
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..54450a041a 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,6 +76,7 @@ Node *replication_parse_result;
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
+%token K_SHUTDOWN_MODE
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -91,6 +92,7 @@ Node *replication_parse_result;
 %type <boolval>	opt_temporary
 %type <list>	create_slot_options create_slot_legacy_opt_list
 %type <defelt>	create_slot_legacy_opt
+%type <str>	opt_shutdown_mode
 
 %%
 
@@ -270,20 +272,22 @@ start_replication:
 					cmd->slotname = $2;
 					cmd->startpoint = $4;
 					cmd->timeline = $5;
+					cmd->shutdownmode = NULL;
 					$$ = (Node *) cmd;
 				}
 			;
 
 /* START_REPLICATION SLOT slot LOGICAL %X/%X options */
 start_logical_replication:
-			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
+			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options
 				{
 					StartReplicationCmd *cmd;
 					cmd = makeNode(StartReplicationCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $3;
 					cmd->startpoint = $5;
-					cmd->options = $6;
+					cmd->shutdownmode = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -336,6 +340,10 @@ opt_timeline:
 				| /* EMPTY */			{ $$ = 0; }
 			;
 
+opt_shutdown_mode:
+			K_SHUTDOWN_MODE SCONST			{ $$ = $2; }
+			| /* EMPTY */					{ $$ = NULL; }
+		;
 
 plugin_options:
 			'(' plugin_opt_list ')'			{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..fcc6f6feda 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
 WAIT				{ return K_WAIT; }
+SHUTDOWN_MODE		{ return K_SHUTDOWN_MODE; }
 
 {space}+		{ /* do nothing */ }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..cfce9d93ef 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -409,6 +409,7 @@ WalReceiverMain(void)
 		options.logical = false;
 		options.startpoint = startpoint;
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
+		options.shutdown_mode = NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		if (walrcv_startstreaming(wrconn, &options))
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..d169092bf7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -219,6 +219,15 @@ typedef struct
 
 static LagTracker *lag_tracker;
 
+/* Indicator for specifying the shutdown mode */
+typedef enum
+{
+	WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+	WALSND_SHUTDOWN_MODE_IMMIDEATE
+} WalSndShutdownMode;
+
+static WalSndShutdownMode shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -260,6 +269,8 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+static void CheckWalSndOptions(const StartReplicationCmd *cmd);
+static void ParseShutdownMode(char *shutdownmode);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -1272,6 +1283,9 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/* Check given options and set flags accordingly */
+	CheckWalSndOptions(cmd);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1450,6 +1464,16 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/*
+		 * In this function, there is a possibility that the walsender is
+		 * stuck. It is caused when the opposite worker is stuck and then the
+		 * send-buffer of the walsender becomes full. Therefore, we must add
+		 * an additional path for shutdown for immediate shutdown mode.
+		 */
+		if (shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE &&
+			got_STOPPING)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -3114,19 +3138,25 @@ WalSndDone(WalSndSendDataCallback send_data)
 	 * To figure out whether all WAL has successfully been replicated, check
 	 * flush location if valid, write otherwise. Tools like pg_receivewal will
 	 * usually (unless in synchronous mode) return an invalid flush location.
+	 *
+	 * If we are in the immediate shutdown mode, flush location and output
+	 * buffer is not checked. This may break the consistency between nodes,
+	 * but it may be useful for the system that has high-latency network to
+	 * reduce the amount of time for shutdown.
 	 */
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	if (WalSndCaughtUp &&
+		(shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
@@ -3849,3 +3879,34 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Check options for walsender itself and set flags accordingly.
+ *
+ * Currently only one option is accepted.
+ */
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+	if (cmd->shutdownmode)
+		ParseShutdownMode(cmd->shutdownmode);
+}
+
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+	if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+		shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+	else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+		shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+	else
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+				errmsg("invalid value for shutdown mode: \"%s\"", shutdownmode),
+				errhint("Available values: wait_flush, immediate."));
+}
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..c96e85e859 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -83,6 +83,7 @@ typedef struct StartReplicationCmd
 	char	   *slotname;
 	TimeLineID	timeline;
 	XLogRecPtr	startpoint;
+	char	   *shutdownmode;
 	List	   *options;
 } StartReplicationCmd;
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index decffe352d..ef6297da52 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -170,6 +170,7 @@ typedef struct
 								 * false if physical stream.  */
 	char	   *slotname;		/* Name of the replication slot or NULL. */
 	XLogRecPtr	startpoint;		/* LSN of starting point. */
+	char	   *shutdown_mode;	/* Name of specified shutdown name */
 
 	union
 	{
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 75fd77b891..0aae5f5dd2 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -523,10 +523,15 @@ $node_publisher->poll_query_until('postgres',
 # inserted on the publisher, and b) when those changes are replicated on the
 # subscriber. Even on slow machines, this strategy will give predictable behavior.
 
-# Set min_apply_delay parameter to 3 seconds
+# Check restart on changing min_apply_delay to 3 seconds
 my $delay = 3;
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for the walsender to restart after changing min_apply_delay to non-zero value";
 
 # Before doing the insertion, get the current timestamp that will be
 # used as a comparison base.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 36d1dc0117..d06a7868ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2976,6 +2976,7 @@ WalReceiverFunctionsType
 WalSnd
 WalSndCtlData
 WalSndSendDataCallback
+WalSndShutdownMode
 WalSndState
 WalTimeSample
 WalUsage
-- 
2.27.0

#51Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#49)
RE: Exit walsender before confirming remote flush in logical replication

Dear Sawada-san,

Thank you for reviewing!

+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+        WalSndShutdownMode shutdown_mode;
+} WalSndOptions;
+
+static WalSndOptions *my_options = NULL;

I'm not sure we need to have it as a struct at this stage since we
support only one option. I wonder if we can have one value, say
shutdown_mode, and we can make it a struct when we really need it.
Even if we use WalSndOptions struct, I don't think we need to
dynamically allocate it. Since a walsender can start logical
replication multiple times in principle, my_options is not freed.

+1, removed the structure.

---
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+        if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+                my_options->shutdown_mode =
WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+        else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+                my_options->shutdown_mode =
WALSND_SHUTDOWN_MODE_IMMIDEATE;
+        else
+                ereport(ERROR,
+                                errcode(ERRCODE_SYNTAX_ERROR),
+                                errmsg("SHUTDOWN_MODE requires
\"wait_flush\" or \"immediate\""));
+}

I think we should make the error message consistent with other enum
parameters. How about the message like:

ERROR: invalid value shutdown mode: "%s"

Modified like enum parameters and hint message was also provided.

New patch is attached in [1]/messages/by-id/TYAPR01MB586683FC450662990E356A0EF5D99@TYAPR01MB5866.jpnprd01.prod.outlook.com.

[1]: /messages/by-id/TYAPR01MB586683FC450662990E356A0EF5D99@TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#52Peter Smith
smithpb2250@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#50)
Re: Exit walsender before confirming remote flush in logical replication

Here are my review comments for the v6-0002 patch.

======
Commit Message

1.
This commit extends START_REPLICATION to accept SHUTDOWN_MODE term. Currently,
it works well only for logical replication.

~

1a.
"to accept SHUTDOWN term" --> "to include a SHUTDOWN_MODE clause."

~

1b.
"it works well only for..." --> do you mean "it is currently
implemented only for..."

~~~

2.
When 'wait_flush', which is the default, is specified, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before exiting the
process. 'immediate' will exit without confirming the remote flush. This may
break the consistency between publisher and subscriber, but it may be useful
for a system that has a high-latency network to reduce the amount of time for
shutdown. This may be useful to shut down the publisher even when the
worker is stuck.

~

SUGGESTION
The shutdown modes are:

1) 'wait_flush' (the default). In this mode, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before
exiting the process.

2) 'immediate'. In this mode, the walsender will exit without
confirming the remote flush. This may break the consistency between
publisher and subscriber. This mode might be useful for a system that
has a high-latency network (to reduce the amount of time for
shutdown), or to allow the shutdown of the publisher even when the
worker is stuck.

======
doc/src/sgml/protocol.sgml

3.
+       <varlistentry>
+        <term><literal>SHUTDOWN_MODE { 'wait_flush' | 'immediate'
}</literal></term>
+        <listitem>
+         <para>
+          Decides the behavior of the walsender process at shutdown. If the
+          shutdown mode is <literal>'wait_flush'</literal>, which is the
+          default, the walsender waits for all the sent WALs to be flushed
+          on the subscriber side. If it is <literal>'immediate'</literal>,
+          the walsender exits without confirming the remote flush.
+         </para>
+        </listitem>
+       </varlistentry>

The synopsis said:
[ SHUTDOWN_MODE shutdown_mode ]

But then the 'shutdown_mode' term was never mentioned again (??).
Instead it says:
SHUTDOWN_MODE { 'wait_flush' | 'immediate' }

IMO the detailed explanation should not say SHUTDOWN_MODE again. It
should be writtenmore like this:

SUGGESTION
shutdown_mode

Determines the behavior of the walsender process at shutdown. If
shutdown_mode is 'wait_flush', the walsender waits for all the sent
WALs to be flushed on the subscriber side. This is the default when
SHUTDOWN_MODE is not specified.

If shutdown_mode is 'immediate', the walsender exits without
confirming the remote flush.

======
.../libpqwalreceiver/libpqwalreceiver.c

4.
+ /* Add SHUTDOWN_MODE option if needed */
+ if (options->shutdown_mode &&
+ PQserverVersion(conn->streamConn) >= 160000)
+ appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
+ options->shutdown_mode);

Maybe you can expand on the meaning of "if needed".

SUGGESTION
Add SHUTDOWN_MODE clause if needed (i.e. if not using the default shutdown_mode)

======
src/backend/replication/logical/worker.c

5. maybe_reread_subscription

+ *
+ * minapplydelay affects SHUTDOWN_MODE option. 'immediate' shutdown mode
+ * will be specified if it is set to non-zero, otherwise default mode will
+ * be set.

Reworded this comment slightly and give a reference to ApplyWorkerMain.

SUGGESTION
Time-delayed logical replication affects the SHUTDOWN_MODE clause. The
'immediate' shutdown mode will be specified if min_apply_delay is
non-zero, otherwise the default shutdown mode will be used. See
ApplyWorkerMain.

~~~

6. ApplyWorkerMain
+ /*
+ * time-delayed logical replication does not support tablesync
+ * workers, so only the leader apply worker can request walsenders to
+ * exit before confirming remote flush.
+ */

"time-delayed" --> "Time-delayed"

======
src/backend/replication/repl_gram.y

7.
@@ -91,6 +92,7 @@ Node *replication_parse_result;
%type <boolval> opt_temporary
%type <list> create_slot_options create_slot_legacy_opt_list
%type <defelt> create_slot_legacy_opt
+%type <str> opt_shutdown_mode

The tab alignment seemed not quite right. Not 100% sure.

~~~

8.
@@ -270,20 +272,22 @@ start_replication:
cmd->slotname = $2;
cmd->startpoint = $4;
cmd->timeline = $5;
+ cmd->shutdownmode = NULL;
$$ = (Node *) cmd;
}

It seemed a bit inconsistent. E.g. the cmd->options member was not set
for physical replication, so why then set this member?

Alternatively, maybe should set cmd->options = NULL here as well?

======
src/backend/replication/walsender.c

9.
+/* Indicator for specifying the shutdown mode */
+typedef enum
+{
+ WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+ WALSND_SHUTDOWN_MODE_IMMIDEATE
+} WalSndShutdownMode;

~

9a.
"Indicator for specifying" (??). How about just saying: "Shutdown modes"

~

9b.
Typo: WALSND_SHUTDOWN_MODE_IMMIDEATE ==> WALSND_SHUTDOWN_MODE_IMMEDIATE

~

9c.
AFAICT the fact that the first enum value is assigned 0 is not really
of importance. If that is correct, then IMO maybe it's better to
remove the "= 0" because the explicit assignment made me expect that
it had special meaning, and then it was confusing when I could not
find a reason.

~~~

10. ProcessPendingWrites

+ /*
+ * In this function, there is a possibility that the walsender is
+ * stuck. It is caused when the opposite worker is stuck and then the
+ * send-buffer of the walsender becomes full. Therefore, we must add
+ * an additional path for shutdown for immediate shutdown mode.
+ */
+ if (shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE &&
+ got_STOPPING)
+ WalSndDone(XLogSendLogical);

10a.
Can this comment say something like "receiving worker" instead of
"opposite worker"?

SUGGESTION
This can happen when the receiving worker is stuck, and then the
send-buffer of the walsender...

~

10b.
IMO it makes more sense to check this around the other way. E.g. we
don't care what is the shutdown_mode value unless got_STOPPING is
true.

SUGGESTION
if (got_STOPPING && (shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE))

~~~

11. WalSndDone

+ * If we are in the immediate shutdown mode, flush location and output
+ * buffer is not checked. This may break the consistency between nodes,
+ * but it may be useful for the system that has high-latency network to
+ * reduce the amount of time for shutdown.

Add some quotes for the mode.

SUGGESTION
'immediate' shutdown mode

~~~

12.
+/*
+ * Check options for walsender itself and set flags accordingly.
+ *
+ * Currently only one option is accepted.
+ */
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+ if (cmd->shutdownmode)
+ ParseShutdownMode(cmd->shutdownmode);
+}
+
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+ if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+ else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid value for shutdown mode: \"%s\"", shutdownmode),
+ errhint("Available values: wait_flush, immediate."));
+}

IMO the ParseShutdownMode function seems unnecessary because it's not
really "parsing" anything and it is only called in one place. I
suggest wrapping everything into the CheckWalSndOptions function. The
end result is still only a simple function:

SUGGESTION

static void
CheckWalSndOptions(const StartReplicationCmd *cmd)
{
if (cmd->shutdownmode)
{
char *mode = cmd->shutdownmode;

if (pg_strcasecmp(mode, "wait_flush") == 0)
shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
else if (pg_strcasecmp(mode, "immediate") == 0)
shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;

else
ereport(ERROR,
errcode(ERRCODE_SYNTAX_ERROR),
errmsg("invalid value for shutdown mode: \"%s\"", mode),
errhint("Available values: wait_flush, immediate."));
}
}

======
src/include/replication/walreceiver.h

13.
@@ -170,6 +170,7 @@ typedef struct
  * false if physical stream.  */
  char    *slotname; /* Name of the replication slot or NULL. */
  XLogRecPtr startpoint; /* LSN of starting point. */
+ char    *shutdown_mode; /* Name of specified shutdown name */

union
{
~

13a.
Typo (shutdown name?)

SUGGESTION
/* The specified shutdown mode string, or NULL. */

~

13b.
Because they have the same member names I kept confusing this option
shutdown_mode with the other enum also called shutdown_mode.

I wonder if is it possible to call this one something like
'shutdown_mode_str' to make reading the code easier?

~

13c.
Is this member in the right place? AFAIK this is not even implemented
for physical replication. e.g. Why isn't this new member part of the
'logical' sub-structure in the union?

======
src/test/subscription/t/001_rep_changes.pl

14.
-# Set min_apply_delay parameter to 3 seconds
+# Check restart on changing min_apply_delay to 3 seconds
 my $delay = 3;
 $node_subscriber->safe_psql('postgres',
  "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for the walsender to restart after
changing min_apply_delay to non-zero value";

IIUC this test is for verifying that a new walsender worker was
started if the delay was changed from 0 to non-zero. E.g. I think it
is for it is testing your new logic of the maybe_reread_subscription.

Probably more complete testing also needs to check the other scenarios:
* min_apply_delay from one non-zero value to another non-zero value
--> verify a new worker is NOT started.
* change min_apply_delay from non-zero to zero --> verify a new worker
IS started

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#53Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Peter Smith (#52)
2 attachment(s)
RE: Exit walsender before confirming remote flush in logical replication

Dear Peter,

Thank you for reviewing! PSA new version.

======
Commit Message

1.
This commit extends START_REPLICATION to accept SHUTDOWN_MODE term.
Currently,
it works well only for logical replication.

~

1a.
"to accept SHUTDOWN term" --> "to include a SHUTDOWN_MODE clause."

Fixed.

1b.
"it works well only for..." --> do you mean "it is currently
implemented only for..."

Fixed.

2.
When 'wait_flush', which is the default, is specified, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before exiting the
process. 'immediate' will exit without confirming the remote flush. This may
break the consistency between publisher and subscriber, but it may be useful
for a system that has a high-latency network to reduce the amount of time for
shutdown. This may be useful to shut down the publisher even when the
worker is stuck.

~

SUGGESTION
The shutdown modes are:

1) 'wait_flush' (the default). In this mode, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before
exiting the process.

2) 'immediate'. In this mode, the walsender will exit without
confirming the remote flush. This may break the consistency between
publisher and subscriber. This mode might be useful for a system that
has a high-latency network (to reduce the amount of time for
shutdown), or to allow the shutdown of the publisher even when the
worker is stuck.

======
doc/src/sgml/protocol.sgml

3.
+       <varlistentry>
+        <term><literal>SHUTDOWN_MODE { 'wait_flush' | 'immediate'
}</literal></term>
+        <listitem>
+         <para>
+          Decides the behavior of the walsender process at shutdown. If the
+          shutdown mode is <literal>'wait_flush'</literal>, which is the
+          default, the walsender waits for all the sent WALs to be flushed
+          on the subscriber side. If it is <literal>'immediate'</literal>,
+          the walsender exits without confirming the remote flush.
+         </para>
+        </listitem>
+       </varlistentry>

The synopsis said:
[ SHUTDOWN_MODE shutdown_mode ]

But then the 'shutdown_mode' term was never mentioned again (??).
Instead it says:
SHUTDOWN_MODE { 'wait_flush' | 'immediate' }

IMO the detailed explanation should not say SHUTDOWN_MODE again. It
should be writtenmore like this:

SUGGESTION
shutdown_mode

Determines the behavior of the walsender process at shutdown. If
shutdown_mode is 'wait_flush', the walsender waits for all the sent
WALs to be flushed on the subscriber side. This is the default when
SHUTDOWN_MODE is not specified.

If shutdown_mode is 'immediate', the walsender exits without
confirming the remote flush.

Fixed.

.../libpqwalreceiver/libpqwalreceiver.c

4.
+ /* Add SHUTDOWN_MODE option if needed */
+ if (options->shutdown_mode &&
+ PQserverVersion(conn->streamConn) >= 160000)
+ appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
+ options->shutdown_mode);

Maybe you can expand on the meaning of "if needed".

SUGGESTION
Add SHUTDOWN_MODE clause if needed (i.e. if not using the default
shutdown_mode)

Fixed, but not completely same as your suggestion.

src/backend/replication/logical/worker.c

5. maybe_reread_subscription

+ *
+ * minapplydelay affects SHUTDOWN_MODE option. 'immediate' shutdown
mode
+ * will be specified if it is set to non-zero, otherwise default mode will
+ * be set.

Reworded this comment slightly and give a reference to ApplyWorkerMain.

SUGGESTION
Time-delayed logical replication affects the SHUTDOWN_MODE clause. The
'immediate' shutdown mode will be specified if min_apply_delay is
non-zero, otherwise the default shutdown mode will be used. See
ApplyWorkerMain.

Fixed.

6. ApplyWorkerMain
+ /*
+ * time-delayed logical replication does not support tablesync
+ * workers, so only the leader apply worker can request walsenders to
+ * exit before confirming remote flush.
+ */

"time-delayed" --> "Time-delayed"

Fixed.

src/backend/replication/repl_gram.y

7.
@@ -91,6 +92,7 @@ Node *replication_parse_result;
%type <boolval> opt_temporary
%type <list> create_slot_options create_slot_legacy_opt_list
%type <defelt> create_slot_legacy_opt
+%type <str> opt_shutdown_mode

The tab alignment seemed not quite right. Not 100% sure.

Fixed accordingly.

8.
@@ -270,20 +272,22 @@ start_replication:
cmd->slotname = $2;
cmd->startpoint = $4;
cmd->timeline = $5;
+ cmd->shutdownmode = NULL;
$$ = (Node *) cmd;
}

It seemed a bit inconsistent. E.g. the cmd->options member was not set
for physical replication, so why then set this member?

Alternatively, maybe should set cmd->options = NULL here as well?

Removed. I checked makeNode() macro, found that palloc0fast() is called there.
This means that we do not have to initialize unused attributes.

src/backend/replication/walsender.c

9.
+/* Indicator for specifying the shutdown mode */
+typedef enum
+{
+ WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+ WALSND_SHUTDOWN_MODE_IMMIDEATE
+} WalSndShutdownMode;

~

9a.
"Indicator for specifying" (??). How about just saying: "Shutdown modes"

Fixed.

9b.
Typo: WALSND_SHUTDOWN_MODE_IMMIDEATE ==>
WALSND_SHUTDOWN_MODE_IMMEDIATE

Replaced.

9c.
AFAICT the fact that the first enum value is assigned 0 is not really
of importance. If that is correct, then IMO maybe it's better to
remove the "= 0" because the explicit assignment made me expect that
it had special meaning, and then it was confusing when I could not
find a reason.

Removed. This was added for skipping the initialization for previous version,
but no longer needed.

10. ProcessPendingWrites

+ /*
+ * In this function, there is a possibility that the walsender is
+ * stuck. It is caused when the opposite worker is stuck and then the
+ * send-buffer of the walsender becomes full. Therefore, we must add
+ * an additional path for shutdown for immediate shutdown mode.
+ */
+ if (shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE &&
+ got_STOPPING)
+ WalSndDone(XLogSendLogical);

10a.
Can this comment say something like "receiving worker" instead of
"opposite worker"?

SUGGESTION
This can happen when the receiving worker is stuck, and then the
send-buffer of the walsender...

Changed.

10b.
IMO it makes more sense to check this around the other way. E.g. we
don't care what is the shutdown_mode value unless got_STOPPING is
true.

SUGGESTION
if (got_STOPPING && (shutdown_mode ==
WALSND_SHUTDOWN_MODE_IMMEDIATE))

Changed.

11. WalSndDone

+ * If we are in the immediate shutdown mode, flush location and output
+ * buffer is not checked. This may break the consistency between nodes,
+ * but it may be useful for the system that has high-latency network to
+ * reduce the amount of time for shutdown.

Add some quotes for the mode.

SUGGESTION
'immediate' shutdown mode

Changed.

12.
+/*
+ * Check options for walsender itself and set flags accordingly.
+ *
+ * Currently only one option is accepted.
+ */
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+ if (cmd->shutdownmode)
+ ParseShutdownMode(cmd->shutdownmode);
+}
+
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+ if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+ else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid value for shutdown mode: \"%s\"", shutdownmode),
+ errhint("Available values: wait_flush, immediate."));
+}

IMO the ParseShutdownMode function seems unnecessary because it's not
really "parsing" anything and it is only called in one place. I
suggest wrapping everything into the CheckWalSndOptions function. The
end result is still only a simple function:

SUGGESTION

static void
CheckWalSndOptions(const StartReplicationCmd *cmd)
{
if (cmd->shutdownmode)
{
char *mode = cmd->shutdownmode;

if (pg_strcasecmp(mode, "wait_flush") == 0)
shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
else if (pg_strcasecmp(mode, "immediate") == 0)
shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;

else
ereport(ERROR,
errcode(ERRCODE_SYNTAX_ERROR),
errmsg("invalid value for shutdown mode: \"%s\"", mode),
errhint("Available values: wait_flush, immediate."));
}
}

Removed.

======
src/include/replication/walreceiver.h

13.
@@ -170,6 +170,7 @@ typedef struct
* false if physical stream.  */
char    *slotname; /* Name of the replication slot or NULL. */
XLogRecPtr startpoint; /* LSN of starting point. */
+ char    *shutdown_mode; /* Name of specified shutdown name */

union
{
~

13a.
Typo (shutdown name?)

SUGGESTION
/* The specified shutdown mode string, or NULL. */

Fixed.

13b.
Because they have the same member names I kept confusing this option
shutdown_mode with the other enum also called shutdown_mode.

I wonder if is it possible to call this one something like
'shutdown_mode_str' to make reading the code easier?

Changed.

13c.
Is this member in the right place? AFAIK this is not even implemented
for physical replication. e.g. Why isn't this new member part of the
'logical' sub-structure in the union?

I remained for future extendibility, but it seemed not to be needed. Moved.

======
src/test/subscription/t/001_rep_changes.pl

14.
-# Set min_apply_delay parameter to 3 seconds
+# Check restart on changing min_apply_delay to 3 seconds
my $delay = 3;
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay =
'${delay}s')");
+$node_publisher->poll_query_until('postgres',
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for the walsender to restart after
changing min_apply_delay to non-zero value";

IIUC this test is for verifying that a new walsender worker was
started if the delay was changed from 0 to non-zero. E.g. I think it
is for it is testing your new logic of the maybe_reread_subscription.

Probably more complete testing also needs to check the other scenarios:
* min_apply_delay from one non-zero value to another non-zero value
--> verify a new worker is NOT started.
* change min_apply_delay from non-zero to zero --> verify a new worker
IS started

Hmm. These tests do not improve the coverage, so not sure we should test or not.
Moreover, IIUC we do not have a good way to verify that the worker does not restart.
Even if the old pid is remained in the pg_stat_replication, there is a possibility
that walsender exits after that. So currently I added only the case that change
min_apply_delay from non-zero to zero.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v7-0001-Time-delayed-logical-replication-subscriber.patchapplication/octet-stream; name=v7-0001-Time-delayed-logical-replication-subscriber.patchDownload
From d3aaf66719dd0b843fd1f457c21b5de07eb284f6 Mon Sep 17 00:00:00 2001
From: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Date: Fri, 10 Feb 2023 10:43:26 +0000
Subject: [PATCH v7 1/2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

This patch implements a new subscription parameter called 'min_apply_delay'.

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs before we start to apply the transaction on the
subscriber. The main reason is to avoid keeping a transaction open for
a long time. Regular and prepared transactions are covered. Streamed
transactions are also covered.

The combination of parallel streaming mode and min_apply_delay is not
allowed. This is because in parallel streaming mode, we start applying
the transaction stream as soon as the first change arrives without
knowing the transaction's prepare/commit time. This means we cannot
calculate the underlying network/decoding lag between publisher and
subscriber, and so always waiting for the full 'min_apply_delay' period
might include unnecessary delay.

The other possibility was to apply the delay at the end of the parallel
apply transaction but that would cause issues related to resource
bloat and locks being held for a long time.

Note that this feature doesn't interact with skip transaction feature.
The skip transaction feature applies to one transaction with a specific LSN.
So, even if the skipped transaction and non-skipped transaction come
consecutively in a very short time, regardless of the order of which comes
first, the time-delayed feature gets balanced by delayed application
for other transactions before and after the skipped transaction.

Author: Euler Taveira, Takamichi Osumi, Kuroda Hayato
Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Shveta Malik,
             Kyotaro Horiguchi, Shi Yu, Wang Wei, Dilip Kumar, Melih Mutlu
Discussion: https://postgr.es/m/CAB-JLwYOYwL=XTyAXKiH5CtM_Vm8KjKh7aaitCKvmCh4rzr5pQ@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                    |   9 +
 doc/src/sgml/config.sgml                      |  12 ++
 doc/src/sgml/glossary.sgml                    |  15 ++
 doc/src/sgml/logical-replication.sgml         |   6 +
 doc/src/sgml/ref/alter_subscription.sgml      |   5 +-
 doc/src/sgml/ref/create_subscription.sgml     |  49 ++++-
 src/backend/catalog/pg_subscription.c         |   1 +
 src/backend/catalog/system_views.sql          |   7 +-
 src/backend/commands/subscriptioncmds.c       | 122 +++++++++++-
 .../replication/logical/applyparallelworker.c |   3 +-
 src/backend/replication/logical/worker.c      | 188 ++++++++++++++++--
 src/backend/utils/activity/wait_event.c       |   3 +
 src/bin/pg_dump/pg_dump.c                     |  15 +-
 src/bin/pg_dump/pg_dump.h                     |   1 +
 src/bin/psql/describe.c                       |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 src/include/catalog/pg_subscription.h         |   3 +
 src/include/replication/worker_internal.h     |   2 +-
 src/include/utils/wait_event.h                |   3 +-
 src/test/regress/expected/subscription.out    | 181 ++++++++++-------
 src/test/regress/sql/subscription.sql         |  24 +++
 src/test/subscription/t/001_rep_changes.pl    |  28 +++
 22 files changed, 584 insertions(+), 106 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5dc5ca1133 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7873,6 +7873,15 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </para></entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subminapplydelay</structfield> <type>int4</type>
+      </para>
+      <para>
+       The minimum delay, in milliseconds, for applying changes
+      </para></entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subname</structfield> <type>name</type>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8c56b134a8..21b45c68e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4787,6 +4787,18 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
        the <filename>postgresql.conf</filename> file or on the server
        command line.
       </para>
+      <para>
+       For time-delayed logical replication, the apply worker sends a feedback
+       message to the publisher every
+       <varname>wal_receiver_status_interval</varname> milliseconds. Make sure
+       to set <varname>wal_receiver_status_interval</varname> less than the
+       <varname>wal_sender_timeout</varname> on the publisher, otherwise, the
+       <literal>walsender</literal> will repeatedly terminate due to timeout
+       errors. Note that if <varname>wal_receiver_status_interval</varname> is
+       set to zero, the apply worker sends no feedback messages during the
+       <literal>min_apply_delay</literal> period. Refer to
+       <xref linkend="sql-createsubscription"/> for more information.
+      </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..9ede9d05f6 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1729,6 +1729,21 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-time-delayed-replication">
+   <glossterm>Time-delayed replication</glossterm>
+   <glossdef>
+    <para>
+     Replication setup that delays the application of changes by a specified
+     minimum time-delay period.
+    </para>
+    <para>
+     For more information, see
+     <xref linkend="guc-recovery-min-apply-delay"/> for physical replication
+     and <xref linkend="sql-createsubscription"/> for logical replication.
+    </para>
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-toast">
    <glossterm>TOAST</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6bd5f61e2b 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -247,6 +247,12 @@
    target table.
   </para>
 
+  <para>
+   A subscription can delay the application of changes by specifying the
+   <literal>min_apply_delay</literal> subscription parameter. See
+   <xref linkend="sql-createsubscription"/> for details.
+  </para>
+
   <sect2 id="logical-replication-subscription-slot">
    <title>Replication Slot Management</title>
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..8b7eb28e54 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -213,8 +213,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
       <literal>binary</literal>, <literal>streaming</literal>,
-      <literal>disable_on_error</literal>, and
-      <literal>origin</literal>.
+      <literal>disable_on_error</literal>,
+      <literal>origin</literal>, and
+      <literal>min_apply_delay</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..1b4b8390af 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -349,7 +349,49 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
-      </variablelist></para>
+
+       <varlistentry>
+        <term><literal>min_apply_delay</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          By default, the subscriber applies changes as soon as possible. This
+          parameter allows the user to delay the application of changes by a
+          given time period. If the value is specified without units, it is
+          taken as milliseconds. The default is zero (no delay). See
+          <xref linkend="config-setting-names-values"/> for details on the
+          available valid time units.
+         </para>
+         <para>
+          Any delay becomes effective only after all initial table
+          synchronization has finished and occurs before each transaction starts
+          to get applied on the subscriber. The delay is calculated as the
+          difference between the WAL timestamp as written on the publisher and
+          the current time on the subscriber. Any overhead of time spent in
+          logical decoding and in transferring the transaction may reduce the
+          actual wait time. It is also possible that the overhead already
+          exceeds the requested <literal>min_apply_delay</literal> value, in
+          which case no delay is applied. If the system clocks on publisher and
+          subscriber are not synchronized, this may lead to apply changes
+          earlier than expected, but this is not a major issue because this
+          parameter is typically much larger than the time deviations between
+          servers. Note that if this parameter is set to a long delay, the
+          replication will stop if the replication slot falls behind the current
+          LSN by more than
+          <link linkend="guc-max-slot-wal-keep-size"><literal>max_slot_wal_keep_size</literal></link>.
+         </para>
+         <warning>
+           <para>
+            Delaying the replication means there is a much longer time between
+            making a change on the publisher, and that change being committed
+            on the subscriber. This can impact the performance of synchronous
+            replication. See <xref linkend="guc-synchronous-commit"/>
+            parameter.
+           </para>
+         </warning>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
 
     </listitem>
    </varlistentry>
@@ -420,6 +462,11 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
    published with different column lists are not supported.
   </para>
 
+  <para>
+   A non-zero <literal>min_apply_delay</literal> parameter is not allowed when
+   streaming in parallel mode.
+  </para>
+
   <para>
    We allow non-existent publications to be specified so that users can add
    those later. This means
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a56ae311c3..e19e5cbca2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -66,6 +66,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->skiplsn = subform->subskiplsn;
 	sub->name = pstrdup(NameStr(subform->subname));
 	sub->owner = subform->subowner;
+	sub->minapplydelay = subform->subminapplydelay;
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..317c2010cb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1299,9 +1299,10 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-              subbinary, substream, subtwophasestate, subdisableonerr,
-              subslotname, subsynccommit, subpublications, suborigin)
+GRANT SELECT (oid, subdbid, subskiplsn, subminapplydelay, subname, subowner,
+              subenabled, subbinary, substream, subtwophasestate,
+              subdisableonerr, subslotname, subsynccommit, subpublications,
+              suborigin)
     ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..82e16fd0f9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -66,6 +66,7 @@
 #define SUBOPT_DISABLE_ON_ERR		0x00000400
 #define SUBOPT_LSN					0x00000800
 #define SUBOPT_ORIGIN				0x00001000
+#define SUBOPT_MIN_APPLY_DELAY		0x00002000
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -90,6 +91,7 @@ typedef struct SubOpts
 	bool		disableonerr;
 	char	   *origin;
 	XLogRecPtr	lsn;
+	int32		min_apply_delay;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -100,7 +102,7 @@ static void check_publications_origin(WalReceiverConn *wrconn,
 static void check_duplicates_in_publist(List *publist, Datum *datums);
 static List *merge_publications(List *oldpublist, List *newpublist, bool addpub, const char *subname);
 static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
-
+static int32 defGetMinApplyDelay(DefElem *def);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -146,6 +148,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->disableonerr = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY))
+		opts->min_apply_delay = 0;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -324,6 +328,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_LSN;
 			opts->lsn = lsn;
 		}
+		else if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+				 strcmp(defel->defname, "min_apply_delay") == 0)
+		{
+			if (IsSet(opts->specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				errorConflictingDefElem(defel, pstate);
+
+			opts->specified_opts |= SUBOPT_MIN_APPLY_DELAY;
+			opts->min_apply_delay = defGetMinApplyDelay(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -404,6 +417,32 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 								"slot_name = NONE", "create_slot = false")));
 		}
 	}
+
+	/*
+	 * The combination of parallel streaming mode and min_apply_delay is not
+	 * allowed. This is because in parallel streaming mode, we start applying
+	 * the transaction stream as soon as the first change arrives without
+	 * knowing the transaction's prepare/commit time. This means we cannot
+	 * calculate the underlying network/decoding lag between publisher and
+	 * subscriber, and so always waiting for the full 'min_apply_delay' period
+	 * might include unnecessary delay.
+	 *
+	 * The other possibility was to apply the delay at the end of the parallel
+	 * apply transaction but that would cause issues related to resource bloat
+	 * and locks being held for a long time.
+	 */
+	if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
+		opts->min_apply_delay > 0 &&
+		opts->streaming == LOGICALREP_STREAM_PARALLEL)
+		ereport(ERROR,
+				errcode(ERRCODE_SYNTAX_ERROR),
+
+		/*
+		 * translator: the first %s is a string of the form "parameter > 0"
+		 * and the second one is "option = value".
+		 */
+				errmsg("%s and %s are mutually exclusive options",
+					   "min_apply_delay > 0", "streaming = parallel"));
 }
 
 /*
@@ -560,7 +599,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
+					  SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN |
+					  SUBOPT_MIN_APPLY_DELAY);
 	parse_subscription_options(pstate, stmt->options, supported_opts, &opts);
 
 	/*
@@ -625,6 +665,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	values[Anum_pg_subscription_oid - 1] = ObjectIdGetDatum(subid);
 	values[Anum_pg_subscription_subdbid - 1] = ObjectIdGetDatum(MyDatabaseId);
 	values[Anum_pg_subscription_subskiplsn - 1] = LSNGetDatum(InvalidXLogRecPtr);
+	values[Anum_pg_subscription_subminapplydelay - 1] = Int32GetDatum(opts.min_apply_delay);
 	values[Anum_pg_subscription_subname - 1] =
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
@@ -1054,7 +1095,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 				supported_opts = (SUBOPT_SLOT_NAME |
 								  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
 								  SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR |
-								  SUBOPT_ORIGIN);
+								  SUBOPT_ORIGIN | SUBOPT_MIN_APPLY_DELAY);
 
 				parse_subscription_options(pstate, stmt->options,
 										   supported_opts, &opts);
@@ -1098,6 +1139,19 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.streaming == LOGICALREP_STREAM_PARALLEL &&
+						!IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)
+						&& sub->minapplydelay > 0)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set parallel streaming mode for subscription with %s",
+									   "min_apply_delay"));
+
 					values[Anum_pg_subscription_substream - 1] =
 						CharGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -1111,6 +1165,26 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 						= true;
 				}
 
+				if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY))
+				{
+					/*
+					 * The combination of parallel streaming mode and
+					 * min_apply_delay is not allowed. See
+					 * parse_subscription_options.
+					 */
+					if (opts.min_apply_delay > 0 &&
+						!IsSet(opts.specified_opts, SUBOPT_STREAMING)
+						&& sub->stream == LOGICALREP_STREAM_PARALLEL)
+						ereport(ERROR,
+								errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								errmsg("cannot set %s for subscription in parallel streaming mode",
+									   "min_apply_delay"));
+
+					values[Anum_pg_subscription_subminapplydelay - 1] =
+						Int32GetDatum(opts.min_apply_delay);
+					replaces[Anum_pg_subscription_subminapplydelay - 1] = true;
+				}
+
 				if (IsSet(opts.specified_opts, SUBOPT_ORIGIN))
 				{
 					values[Anum_pg_subscription_suborigin - 1] =
@@ -2195,3 +2269,45 @@ defGetStreamingMode(DefElem *def)
 					def->defname)));
 	return LOGICALREP_STREAM_OFF;	/* keep compiler quiet */
 }
+
+/*
+ * Extract the min_apply_delay value from a DefElem. This is very similar to
+ * parse_and_validate_value() for integer values, because min_apply_delay
+ * accepts the same parameter format as recovery_min_apply_delay.
+ */
+static int32
+defGetMinApplyDelay(DefElem *def)
+{
+	char	   *input_string;
+	int			result;
+	const char *hintmsg;
+
+	input_string = defGetString(def);
+
+	/*
+	 * Parse given string as parameter which has millisecond unit
+	 */
+	if (!parse_int(input_string, &result, GUC_UNIT_MS, &hintmsg))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid value for parameter \"%s\": \"%s\"",
+						"min_apply_delay", input_string),
+				 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+
+	/*
+	 * Check both the lower boundary for the valid min_apply_delay range and
+	 * the upper boundary as the safeguard for some platforms where INT_MAX is
+	 * wider than int32 respectively. Although parse_int() has confirmed that
+	 * the result is less than or equal to INT_MAX, the value will be stored
+	 * in a catalog column of int32.
+	 */
+	if (result < 0 || result > PG_INT32_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)",
+						result,
+						"min_apply_delay",
+						0, PG_INT32_MAX)));
+
+	return result;
+}
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index da437e0bc3..32db20fd98 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -704,7 +704,8 @@ pa_process_spooled_messages_if_required(void)
 	{
 		apply_spooled_messages(&MyParallelShared->fileset,
 							   MyParallelShared->xid,
-							   InvalidXLogRecPtr);
+							   InvalidXLogRecPtr,
+							   0);
 		pa_set_fileset_state(MyParallelShared, FS_EMPTY);
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfb2ab6248..6b86723b60 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -319,6 +319,20 @@ static List *on_commit_wakeup_workers_subids = NIL;
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
+/*
+ * In order to avoid walsender timeout for time-delayed logical replication the
+ * apply worker keeps sending feedback messages during the delay period.
+ * Meanwhile, the feature delays the apply before the start of the
+ * transaction and thus we don't write WAL records for the suspended changes
+ * during the wait. When the apply worker sends a feedback message during the
+ * delay, we should not overwrite positions of the flushed and apply LSN by the
+ * last received latest LSN. See send_feedback() for details.
+ */
+static XLogRecPtr last_received = InvalidXLogRecPtr;
+
+/* The last time we send a feedback message */
+static TimestampTz send_time = 0;
+
 /* fields valid only when processing streamed transaction */
 static bool in_streamed_transaction = false;
 
@@ -389,7 +403,8 @@ static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
 static void stream_close_file(void);
 
-static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+						  bool has_unprocessed_change);
 
 static void DisableSubscriptionAndExit(void);
 
@@ -999,6 +1014,128 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 	ExecStoreVirtualTuple(slot);
 }
 
+/*
+ * When min_apply_delay parameter is set on the subscriber, we wait long enough
+ * to make sure a transaction is applied at least that period behind the
+ * publisher.
+ *
+ * While the physical replication applies the delay at commit time, this
+ * feature applies the delay for the next transaction but before starting the
+ * transaction. This is mainly because keeping a transaction that conducted
+ * write operations open for a long time results in some issues such as bloat
+ * and locks.
+ *
+ * The min_apply_delay parameter will take effect only after all tables are in
+ * READY state.
+ *
+ * xid is the transaction id where we apply the delay.
+ *
+ * finish_ts is the commit/prepare time of both regular (non-streamed) and
+ * streamed transactions. Unlike the regular (non-streamed) cases, the delay
+ * is applied in a STREAM COMMIT/STREAM PREPARE message for streamed
+ * transactions. The STREAM START message does not contain a commit/prepare
+ * time (it will be available when the in-progress transaction finishes).
+ * Hence, it's not appropriate to apply a delay at the STREAM START time.
+ */
+static void
+maybe_apply_delay(TransactionId xid, TimestampTz finish_ts)
+{
+	long		status_interval_ms = 0;
+
+	Assert(finish_ts > 0);
+
+	/* Nothing to do if no delay set */
+	if (!MySubscription->minapplydelay)
+		return;
+
+	/*
+	 * The min_apply_delay parameter is ignored until all tablesync workers
+	 * have reached READY state. This is because if we allowed the delay
+	 * during the catchup phase, then once we reached the limit of tablesync
+	 * workers it would impose a delay for each subsequent worker. That would
+	 * cause initial table synchronization completion to take a long time.
+	 */
+	if (!AllTablesyncsReady())
+		return;
+
+	/* Apply the delay by the latch mechanism */
+	do
+	{
+		TimestampTz delayUntil;
+		long		diffms;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* This might change wal_receiver_status_interval */
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/*
+		 * Before calculating the time duration, reload the catalog if needed.
+		 */
+		if (!in_remote_transaction && !in_streamed_transaction)
+		{
+			AcceptInvalidationMessages();
+			maybe_reread_subscription();
+		}
+
+		delayUntil = TimestampTzPlusMilliseconds(finish_ts, MySubscription->minapplydelay);
+		diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+		/*
+		 * Exit without arming the latch if it's already past time to apply
+		 * this transaction.
+		 */
+		if (diffms <= 0)
+			break;
+
+		elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay = %d ms, remaining wait time: %ld ms",
+			 xid, MySubscription->minapplydelay, diffms);
+
+		/*
+		 * Call send_feedback() to prevent the publisher from exiting by
+		 * timeout during the delay, when the status interval is greater than
+		 * zero.
+		 */
+		if (!status_interval_ms)
+		{
+			TimestampTz nextFeedback;
+
+			/*
+			 * Based on the last time when we send a feedback message, adjust
+			 * the first delay time for this transaction. This ensures that
+			 * the first feedback message follows wal_receiver_status_interval
+			 * interval.
+			 */
+			nextFeedback = TimestampTzPlusMilliseconds(send_time,
+													   wal_receiver_status_interval * 1000L);
+			status_interval_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), nextFeedback);
+		}
+		else
+			status_interval_ms = wal_receiver_status_interval * 1000L;
+
+		if (status_interval_ms > 0 && diffms > status_interval_ms)
+		{
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  status_interval_ms,
+					  WAIT_EVENT_LOGICAL_APPLY_DELAY);
+			send_feedback(last_received, true, false, true);
+		}
+		else
+			WaitLatch(MyLatch,
+					  WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+					  diffms,
+					  WAIT_EVENT_LOGICAL_APPLY_DELAY);
+
+	} while (true);
+}
+
 /*
  * Handle BEGIN message.
  */
@@ -1013,6 +1150,9 @@ apply_handle_begin(StringInfo s)
 	logicalrep_read_begin(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
 
+	/* Should we delay the current transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.committime);
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	maybe_start_skipping_changes(begin_data.final_lsn);
@@ -1070,6 +1210,9 @@ apply_handle_begin_prepare(StringInfo s)
 	logicalrep_read_begin_prepare(s, &begin_data);
 	set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
 
+	/* Should we delay the current prepared transaction? */
+	maybe_apply_delay(begin_data.xid, begin_data.prepare_time);
+
 	remote_final_lsn = begin_data.prepare_lsn;
 
 	maybe_start_skipping_changes(begin_data.prepare_lsn);
@@ -1317,7 +1460,8 @@ apply_handle_stream_prepare(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
-								   prepare_data.xid, prepare_data.prepare_lsn);
+								   prepare_data.xid, prepare_data.prepare_lsn,
+								   prepare_data.prepare_time);
 
 			/* Mark the transaction as prepared. */
 			apply_handle_prepare_internal(&prepare_data);
@@ -2011,10 +2155,13 @@ ensure_last_message(FileSet *stream_fileset, TransactionId xid, int fileno,
 
 /*
  * Common spoolfile processing.
+ *
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */
 void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-					   XLogRecPtr lsn)
+					   XLogRecPtr lsn, TimestampTz finish_ts)
 {
 	StringInfoData s2;
 	int			nchanges;
@@ -2025,6 +2172,10 @@ apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
 	int			fileno;
 	off_t		offset;
 
+	/* Should we delay the current transaction? */
+	if (finish_ts)
+		maybe_apply_delay(xid, finish_ts);
+
 	if (!am_parallel_apply_worker())
 		maybe_start_skipping_changes(lsn);
 
@@ -2174,7 +2325,7 @@ apply_handle_stream_commit(StringInfo s)
 			 * spooled operations.
 			 */
 			apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
-								   commit_data.commit_lsn);
+								   commit_data.commit_lsn, commit_data.committime);
 
 			apply_handle_commit_internal(&commit_data);
 
@@ -3447,7 +3598,7 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
  * Apply main loop.
  */
 static void
-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)
 {
 	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
 	bool		ping_sent = false;
@@ -3568,7 +3719,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						if (last_received < end_lsn)
 							last_received = end_lsn;
 
-						send_feedback(last_received, reply_requested, false);
+						send_feedback(last_received, reply_requested, false, false);
 						UpdateWorkerStats(last_received, timestamp, true);
 					}
 					/* other message types are purposefully ignored */
@@ -3581,7 +3732,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		}
 
 		/* confirm all writes so far */
-		send_feedback(last_received, false, false);
+		send_feedback(last_received, false, false, false);
 
 		if (!in_remote_transaction && !in_streamed_transaction)
 		{
@@ -3678,7 +3829,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 				}
 			}
 
-			send_feedback(last_received, requestReply, requestReply);
+			send_feedback(last_received, requestReply, requestReply, false);
 
 			/*
 			 * Force reporting to ensure long idle periods don't lead to
@@ -3708,10 +3859,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  * to send a response to avoid timeouts.
  */
 static void
-send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
+send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, bool has_unprocessed_change)
 {
 	static StringInfo reply_message = NULL;
-	static TimestampTz send_time = 0;
 
 	static XLogRecPtr last_recvpos = InvalidXLogRecPtr;
 	static XLogRecPtr last_writepos = InvalidXLogRecPtr;
@@ -3738,8 +3888,14 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	/*
 	 * No outstanding transactions to flush, we can report the latest received
 	 * position. This is important for synchronous replication.
+	 *
+	 * If the logical replication subscription has unprocessed changes then do
+	 * not inform the publisher that the received latest LSN is already
+	 * applied and flushed, otherwise, the publisher will make a wrong
+	 * assumption about the logical replication progress. Instead, just send a
+	 * feedback message to avoid a replication timeout during the delay.
 	 */
-	if (!have_pending_txes)
+	if (!have_pending_txes && !has_unprocessed_change)
 		flushpos = writepos = recvpos;
 
 	if (writepos < last_writepos)
@@ -3776,8 +3932,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
-	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
+	elog(DEBUG2, "sending feedback (force %d, has_unprocessed_change %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
+		 has_unprocessed_change,
 		 LSN_FORMAT_ARGS(recvpos),
 		 LSN_FORMAT_ARGS(writepos),
 		 LSN_FORMAT_ARGS(flushpos));
@@ -4367,11 +4524,11 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
  * of system resource error and are not repeatable.
  */
 static void
-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)
 {
 	PG_TRY();
 	{
-		LogicalRepApplyLoop(origin_startpos);
+		LogicalRepApplyLoop();
 	}
 	PG_CATCH();
 	{
@@ -4661,7 +4818,8 @@ ApplyWorkerMain(Datum main_arg)
 	}
 
 	/* Run the main loop. */
-	start_apply(origin_startpos);
+	last_received = origin_startpos;
+	start_apply();
 
 	proc_exit(0);
 }
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 6e4599278c..dd06927328 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -512,6 +512,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_VACUUM_TRUNCATE:
 			event_name = "VacuumTruncate";
 			break;
+		case WAIT_EVENT_LOGICAL_APPLY_DELAY:
+			event_name = "LogicalApplyDelay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 527c7651ab..1e87f0124e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4494,6 +4494,7 @@ getSubscriptions(Archive *fout)
 	int			i_subsynccommit;
 	int			i_subpublications;
 	int			i_subbinary;
+	int			i_subminapplydelay;
 	int			i,
 				ntups;
 
@@ -4546,9 +4547,13 @@ getSubscriptions(Archive *fout)
 						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	if (fout->remoteVersion >= 160000)
-		appendPQExpBufferStr(query, " s.suborigin\n");
+		appendPQExpBufferStr(query,
+							 " s.suborigin,\n"
+							 " s.subminapplydelay\n");
 	else
-		appendPQExpBuffer(query, " '%s' AS suborigin\n", LOGICALREP_ORIGIN_ANY);
+		appendPQExpBuffer(query, " '%s' AS suborigin,\n"
+						  " 0 AS subminapplydelay\n",
+						  LOGICALREP_ORIGIN_ANY);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4576,6 +4581,7 @@ getSubscriptions(Archive *fout)
 	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 	i_subdisableonerr = PQfnumber(res, "subdisableonerr");
 	i_suborigin = PQfnumber(res, "suborigin");
+	i_subminapplydelay = PQfnumber(res, "subminapplydelay");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4606,6 +4612,8 @@ getSubscriptions(Archive *fout)
 		subinfo[i].subdisableonerr =
 			pg_strdup(PQgetvalue(res, i, i_subdisableonerr));
 		subinfo[i].suborigin = pg_strdup(PQgetvalue(res, i, i_suborigin));
+		subinfo[i].subminapplydelay =
+			atoi(PQgetvalue(res, i, i_subminapplydelay));
 
 		/* Decide whether we want to dump it */
 		selectDumpableObject(&(subinfo[i].dobj), fout);
@@ -4687,6 +4695,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
+	if (subinfo->subminapplydelay > 0)
+		appendPQExpBuffer(query, ", min_apply_delay = '%d ms'", subinfo->subminapplydelay);
+
 	appendPQExpBufferStr(query, ");\n");
 
 	if (subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index e7cbd8d7ed..b8831c3ed3 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -661,6 +661,7 @@ typedef struct _SubscriptionInfo
 	char	   *subdisableonerr;
 	char	   *suborigin;
 	char	   *subsynccommit;
+	int			subminapplydelay;
 	char	   *subpublications;
 } SubscriptionInfo;
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c8a0bb7b3a..81d4607a1c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6472,7 +6472,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false, false, false, false, false};
+	false, false, false, false, false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6527,10 +6527,13 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Two-phase commit"),
 							  gettext_noop("Disable on error"));
 
+		/* Origin and min_apply_delay are only supported in v16 and higher */
 		if (pset.sversion >= 160000)
 			appendPQExpBuffer(&buf,
-							  ", suborigin AS \"%s\"\n",
-							  gettext_noop("Origin"));
+							  ", suborigin AS \"%s\"\n"
+							  ", subminapplydelay AS \"%s\"\n",
+							  gettext_noop("Origin"),
+							  gettext_noop("Min apply delay"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e1882eaea..e8b9a43a47 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1925,7 +1925,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "PUBLICATION");
 	/* ALTER SUBSCRIPTION <name> SET ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SET", "("))
-		COMPLETE_WITH("binary", "disable_on_error", "origin", "slot_name",
+		COMPLETE_WITH("binary", "disable_on_error", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit");
 	/* ALTER SUBSCRIPTION <name> SKIP ( */
 	else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && TailMatches("SKIP", "("))
@@ -3268,7 +3268,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-					  "disable_on_error", "enabled", "origin", "slot_name",
+					  "disable_on_error", "enabled", "min_apply_delay", "origin", "slot_name",
 					  "streaming", "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index b0f2a1705d..d1cfefc6d6 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	Oid			subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
 
+	int32		subminapplydelay;	/* Replication apply delay (ms) */
+
 	bool		subenabled;		/* True if the subscription is enabled (the
 								 * worker should be running) */
 
@@ -122,6 +124,7 @@ typedef struct Subscription
 								 * skipped */
 	char	   *name;			/* Name of the subscription */
 	Oid			owner;			/* Oid of the subscription owner */
+	int32		minapplydelay;	/* Replication apply delay (ms) */
 	bool		enabled;		/* Indicates if the subscription is enabled */
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index dc87a4edd1..3dc09d1a4c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -255,7 +255,7 @@ extern void stream_stop_internal(TransactionId xid);
 
 /* Common streaming function to apply all the spooled messages */
 extern void apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-								   XLogRecPtr lsn);
+								   XLogRecPtr lsn, TimestampTz finish_ts);
 
 extern void apply_dispatch(StringInfo s);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6cacd6edaf..f95c5fee8c 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -149,7 +149,8 @@ typedef enum
 	WAIT_EVENT_REGISTER_SYNC_REQUEST,
 	WAIT_EVENT_SPIN_DELAY,
 	WAIT_EVENT_VACUUM_DELAY,
-	WAIT_EVENT_VACUUM_TRUNCATE
+	WAIT_EVENT_VACUUM_TRUNCATE,
+	WAIT_EVENT_LOGICAL_APPLY_DELAY
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 3f99b14394..cf8e727ee9 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -114,18 +114,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION 'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | none   |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub4 SET (origin = any);
 \dRs+ regress_testsub4
-                                                                                         List of subscriptions
-       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub3;
@@ -143,10 +143,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -163,10 +163,10 @@ ERROR:  unrecognized subscription parameter: "create_slot"
 -- ok
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/12345');
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/12345
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/12345
 (1 row)
 
 -- ok - with lsn = NONE
@@ -175,10 +175,10 @@ ALTER SUBSCRIPTION regress_testsub SKIP (lsn = NONE);
 ALTER SUBSCRIPTION regress_testsub SKIP (lsn = '0/0');
 ERROR:  invalid WAL location (LSN): 0/0
 \dRs+
-                                                                                             List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist2 | 0/0
+                                                                                                      List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 BEGIN;
@@ -210,10 +210,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                                               List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |           Conninfo           | Skip LSN 
----------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+--------------------+------------------------------+----------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    | local              | dbname=regress_doesnotexist2 | 0/0
+                                                                                                        List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |           Conninfo           | Skip LSN 
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+------------------------------+----------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | off       | d                | f                | any    |               0 | local              | dbname=regress_doesnotexist2 | 0/0
 (1 row)
 
 -- rename back to keep the rest simple
@@ -247,19 +247,19 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -271,27 +271,27 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | parallel  | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication already exists
@@ -306,10 +306,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                                                 List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 -- fail - publication used more than once
@@ -324,10 +324,10 @@ ERROR:  publication "testpub3" is not in subscription "regress_testsub"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -363,10 +363,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 --fail - alter of two_phase option not supported.
@@ -375,10 +375,10 @@ ERROR:  unrecognized subscription parameter: "two_phase"
 -- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -388,10 +388,10 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | on        | p                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
@@ -404,20 +404,57 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 \dRs+
-                                                                                         List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Synchronous commit |          Conninfo           | Skip LSN 
------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+--------------------+-----------------------------+----------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    | off                | dbname=regress_doesnotexist | 0/0
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | t                | any    |               0 | off                | dbname=regress_doesnotexist | 0/0
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+ERROR:  invalid value for parameter "min_apply_delay": "foo"
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+ERROR:  -1 ms is outside the valid range for parameter "min_apply_delay" (0 .. 2147483647)
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+ERROR:  min_apply_delay > 0 and streaming = parallel are mutually exclusive options
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |             123 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+                                                                                                  List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two-phase commit | Disable on error | Origin | Min apply delay | Synchronous commit |          Conninfo           | Skip LSN 
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-----------------+--------------------+-----------------------------+----------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | off       | d                | f                | any    |        86400000 | off                | dbname=regress_doesnotexist | 0/0
+(1 row)
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+ERROR:  cannot set parallel streaming mode for subscription with min_apply_delay
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ERROR:  cannot set min_apply_delay for subscription in parallel streaming mode
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 7281f5fee2..7317b140f5 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -286,6 +286,30 @@ ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail -- min_apply_delay must be a non-negative integer
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = foo);
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = -1);
+
+-- fail - utilizing streaming = parallel with time-delayed replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = parallel, min_apply_delay = 123);
+
+-- success -- min_apply_delay value without unit is taken as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, min_apply_delay = 123);
+\dRs+
+
+-- success -- min_apply_delay value with unit is converted into ms and stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = '1 d');
+\dRs+
+
+-- fail - alter subscription with streaming = parallel should fail when time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
+
+-- fail - alter subscription with min_apply_delay should fail when streaming = parallel is set
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 0, streaming = parallel);
+ALTER SUBSCRIPTION regress_testsub SET (min_apply_delay = 123);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 91aa068c95..75fd77b891 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -515,6 +515,34 @@ $node_publisher->poll_query_until('postgres',
   or die
   "Timed out while waiting for apply to restart after renaming SUBSCRIPTION";
 
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# verify this by looking at the time difference between a) when tuples are
+# inserted on the publisher, and b) when those changes are replicated on the
+# subscriber. Even on slow machines, this strategy will give predictable behavior.
+
+# Set min_apply_delay parameter to 3 seconds
+my $delay = 3;
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+
+# Before doing the insertion, get the current timestamp that will be
+# used as a comparison base.
+my $publisher_insert_time = time();
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_ins VALUES (generate_series(1101, 1120))");
+
+# The publisher waits for the replication to complete
+$node_publisher->wait_for_catchup('tap_sub_renamed');
+
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+	"subscriber applies WAL only after replication delay for non-streaming transaction"
+);
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
-- 
2.27.0

v7-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchapplication/octet-stream; name=v7-0002-Extend-START_REPLICATION-command-to-accept-walsen.patchDownload
From 3c136f8fb235882c2d26b59f34e5b21c69b8cb6a Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Wed, 8 Feb 2023 09:09:31 +0000
Subject: [PATCH v7 2/2] Extend START_REPLICATION command to accept walsender
 options

This commit extends START_REPLICATION to accept SHUTDOWN_MODE clause. It is
currently implemented only for logical replication.

The shutdown modes are:

1) 'wait_flush' (the default). In this mode, the walsender will wait for all the
sent WALs to be flushed on the subscriber side, before exiting the process.

2) 'immediate'. In this mode, the walsender will exit without confirming the
remote flush. This may break the consistency between publisher and subscriber.
This mode might be useful for a system that has a high-latency network (to
reduce the amount of time for shutdown), or to allow the shutdown of the
publisher even when the worker is stuck.

Author: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/protocol.sgml                    | 16 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c       |  7 +++
 src/backend/replication/logical/worker.c      | 18 +++++-
 src/backend/replication/repl_gram.y           | 11 +++-
 src/backend/replication/repl_scanner.l        |  1 +
 src/backend/replication/walsender.c           | 60 ++++++++++++++++++-
 src/include/nodes/replnodes.h                 |  1 +
 src/include/replication/walreceiver.h         |  2 +
 src/test/subscription/t/001_rep_changes.pl    | 22 ++++++-
 src/tools/pgindent/typedefs.list              |  1 +
 10 files changed, 131 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 93fc7167d4..1ba5238060 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2500,7 +2500,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
     </varlistentry>
 
     <varlistentry id="protocol-replication-start-replication-slot-logical">
-     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
+     <term><literal>START_REPLICATION</literal> <literal>SLOT</literal> <replaceable class="parameter">slot_name</replaceable> <literal>LOGICAL</literal> <replaceable class="parameter">XXX/XXX</replaceable> [ <literal>SHUTDOWN_MODE</literal> { <literal>'wait_flush'</literal> | <literal>'immediate'</literal> } ] [ ( <replaceable>option_name</replaceable> [ <replaceable>option_value</replaceable> ] [, ...] ) ]</term>
      <listitem>
       <para>
        Instructs server to start streaming WAL for logical replication,
@@ -2555,6 +2555,20 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
         </listitem>
        </varlistentry>
 
+       <varlistentry>
+        <term><literal>shutdown_mode</literal></term>
+        <listitem>
+         <para>
+          Determines the behavior of the walsender process at shutdown. If
+          shutdown_mode is <literal>'wait_flush'</literal>, the walsender waits
+          for all the sent WALs to be flushed on the subscriber side. This is
+          the default when SHUTDOWN_MODE is not specified. If shutdown_mode is
+          <literal>'immediate'</literal>, the walsender exits without
+          confirming the remote flush.
+         </para>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><replaceable class="parameter">option_name</replaceable></term>
         <listitem>
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 560ec974fa..7aad751a86 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -403,6 +403,12 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		List	   *pubnames;
 		char	   *pubnames_literal;
 
+		/* Add SHUTDOWN_MODE clause if not using the default shutdown_mode */
+		if (options->proto.logical.shutdown_mode_str &&
+			PQserverVersion(conn->streamConn) >= 160000)
+			appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
+							 options->proto.logical.shutdown_mode_str);
+
 		appendStringInfoString(&cmd, " (");
 
 		appendStringInfo(&cmd, "proto_version '%u'",
@@ -449,6 +455,7 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, " TIMELINE %u",
 						 options->proto.physical.startpointTLI);
 
+
 	/* Start streaming. */
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6b86723b60..be8be077c5 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4043,10 +4043,16 @@ maybe_reread_subscription(void)
 
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
+	 *
 	 * The launcher will start a new worker but note that the parallel apply
 	 * worker won't restart if the streaming option's value is changed from
 	 * 'parallel' to any other value or the server decides not to stream the
 	 * in-progress transaction.
+	 *
+	 * Time-delayed logical replication affects the SHUTDOWN_MODE clause. The
+	 * 'immediate' shutdown mode will be specified if min_apply_delay is
+	 * non-zero, otherwise the default shutdown mode will be used. See
+	 * ApplyWorkerMain.
 	 */
 	if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
 		strcmp(newsub->name, MySubscription->name) != 0 ||
@@ -4055,7 +4061,8 @@ maybe_reread_subscription(void)
 		newsub->stream != MySubscription->stream ||
 		strcmp(newsub->origin, MySubscription->origin) != 0 ||
 		newsub->owner != MySubscription->owner ||
-		!equal(newsub->publications, MySubscription->publications))
+		!equal(newsub->publications, MySubscription->publications) ||
+		(newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))
 	{
 		if (am_parallel_apply_worker())
 			ereport(LOG,
@@ -4774,9 +4781,18 @@ ApplyWorkerMain(Datum main_arg)
 
 	options.proto.logical.twophase = false;
 	options.proto.logical.origin = pstrdup(MySubscription->origin);
+	options.proto.logical.shutdown_mode_str = NULL;
 
 	if (!am_tablesync_worker())
 	{
+		/*
+		 * Time-delayed logical replication does not support tablesync
+		 * workers, so only the leader apply worker can request walsenders to
+		 * exit before confirming remote flush.
+		 */
+		if (server_version >= 160000 && MySubscription->minapplydelay > 0)
+			options.proto.logical.shutdown_mode_str = pstrdup("immediate");
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 0c874e33cf..11a01e6b60 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,6 +76,7 @@ Node *replication_parse_result;
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
+%token K_SHUTDOWN_MODE
 
 %type <node>	command
 %type <node>	base_backup start_replication start_logical_replication
@@ -91,6 +92,7 @@ Node *replication_parse_result;
 %type <boolval>	opt_temporary
 %type <list>	create_slot_options create_slot_legacy_opt_list
 %type <defelt>	create_slot_legacy_opt
+%type <str>		opt_shutdown_mode
 
 %%
 
@@ -276,14 +278,15 @@ start_replication:
 
 /* START_REPLICATION SLOT slot LOGICAL %X/%X options */
 start_logical_replication:
-			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
+			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options
 				{
 					StartReplicationCmd *cmd;
 					cmd = makeNode(StartReplicationCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $3;
 					cmd->startpoint = $5;
-					cmd->options = $6;
+					cmd->shutdownmode = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -336,6 +339,10 @@ opt_timeline:
 				| /* EMPTY */			{ $$ = 0; }
 			;
 
+opt_shutdown_mode:
+			K_SHUTDOWN_MODE SCONST			{ $$ = $2; }
+			| /* EMPTY */					{ $$ = NULL; }
+		;
 
 plugin_options:
 			'(' plugin_opt_list ')'			{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..fcc6f6feda 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -136,6 +136,7 @@ EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
 WAIT				{ return K_WAIT; }
+SHUTDOWN_MODE		{ return K_SHUTDOWN_MODE; }
 
 {space}+		{ /* do nothing */ }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 75e8363e24..b0f8a2d320 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -219,6 +219,15 @@ typedef struct
 
 static LagTracker *lag_tracker;
 
+/* Shutdown modes */
+typedef enum
+{
+	WALSND_SHUTDOWN_MODE_WAIT_FLUSH,
+	WALSND_SHUTDOWN_MODE_IMMEDIATE
+} WalSndShutdownMode;
+
+static WalSndShutdownMode shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -260,6 +269,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
 
+static void CheckWalSndOptions(const StartReplicationCmd *cmd);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -1272,6 +1282,9 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		got_STOPPING = true;
 	}
 
+	/* Check given options and set flags accordingly */
+	CheckWalSndOptions(cmd);
+
 	/*
 	 * Create our decoding context, making it start at the previously ack'ed
 	 * position.
@@ -1450,6 +1463,16 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/*
+		 * In this function, there is a possibility that the walsender is
+		 * stuck. This can happen when the receiving worker is stuck, and then
+		 * the send-buffer of the walsender becomes full. Therefore, we must
+		 * add an additional path for shutdown for immediate shutdown mode.
+		 */
+		if (got_STOPPING &&
+			shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE)
+			WalSndDone(XLogSendLogical);
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -3114,19 +3137,25 @@ WalSndDone(WalSndSendDataCallback send_data)
 	 * To figure out whether all WAL has successfully been replicated, check
 	 * flush location if valid, write otherwise. Tools like pg_receivewal will
 	 * usually (unless in synchronous mode) return an invalid flush location.
+	 *
+	 * If we are in the 'immediate' shutdown mode, flush location and output
+	 * buffer is not checked. This may break the consistency between nodes,
+	 * but it may be useful for the system that has high-latency network to
+	 * reduce the amount of time for shutdown.
 	 */
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
-		!pq_is_send_pending())
+	if (WalSndCaughtUp &&
+		(shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE ||
+		 (sentPtr == replicatedPtr && !pq_is_send_pending())))
 	{
 		QueryCompletion qc;
 
 		/* Inform the standby that XLOG streaming is done */
 		SetQueryCompletion(&qc, CMDTAG_COPY, 0);
 		EndCommand(&qc, DestRemote, false);
-		pq_flush();
+		pq_flush_if_writable();
 
 		proc_exit(0);
 	}
@@ -3849,3 +3878,28 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Check options for walsender itself and set flags accordingly.
+ *
+ * Currently only one option is accepted.
+ */
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+	if (cmd->shutdownmode)
+	{
+		char	   *mode = cmd->shutdownmode;
+
+		if (pg_strcasecmp(mode, "wait_flush") == 0)
+			shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+		else if (pg_strcasecmp(mode, "immediate") == 0)
+			shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;
+		else
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("invalid value for shutdown mode: \"%s\"", mode),
+					errhint("Available values: wait_flush, immediate."));
+	}
+
+}
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 4321ba8f86..c96e85e859 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -83,6 +83,7 @@ typedef struct StartReplicationCmd
 	char	   *slotname;
 	TimeLineID	timeline;
 	XLogRecPtr	startpoint;
+	char	   *shutdownmode;
 	List	   *options;
 } StartReplicationCmd;
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index decffe352d..91d9f7b3b1 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -187,6 +187,8 @@ typedef struct
 									 * prepare time */
 			char	   *origin; /* Only publish data originating from the
 								 * specified origin */
+			char	   *shutdown_mode_str;	/* The specified shutdown mode
+											 * string, or NULL. */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 75fd77b891..d649c78de7 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -523,10 +523,18 @@ $node_publisher->poll_query_until('postgres',
 # inserted on the publisher, and b) when those changes are replicated on the
 # subscriber. Even on slow machines, this strategy will give predictable behavior.
 
-# Set min_apply_delay parameter to 3 seconds
+# Check restart on changing min_apply_delay to 3 seconds
 my $delay = 3;
+$oldpid = $node_publisher->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+);
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for the walsender to restart after changing min_apply_delay to non-zero value";
 
 # Before doing the insertion, get the current timestamp that will be
 # used as a comparison base.
@@ -543,6 +551,18 @@ ok( time() - $publisher_insert_time >= $delay,
 	"subscriber applies WAL only after replication delay for non-streaming transaction"
 );
 
+# Check restart on changing min_apply_delay to zero
+$oldpid = $node_publisher->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+);
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '0')");
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for the walsender to restart after changing min_apply_delay to zero";
+
 # check all the cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_renamed");
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 36d1dc0117..d06a7868ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2976,6 +2976,7 @@ WalReceiverFunctionsType
 WalSnd
 WalSndCtlData
 WalSndSendDataCallback
+WalSndShutdownMode
 WalSndState
 WalTimeSample
 WalUsage
-- 
2.27.0

#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#53)
Re: Exit walsender before confirming remote flush in logical replication

On Fri, Feb 10, 2023 at 5:24 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Can't we have this option just as a bool (like shutdown_immediate)?
Why do we want to keep multiple modes?

--
With Regards,
Amit Kapila.

#55Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#54)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Can't we have this option just as a bool (like shutdown_immediate)?
Why do we want to keep multiple modes?

Of course we can use boolean instead, but current style is motivated by the post[1]/messages/by-id/20230208.112717.1140830361804418505.horikyota.ntt@gmail.com.
This allows to add another option in future, whereas I do not have idea now.

I want to ask other reviewers which one is better...

[1]: /messages/by-id/20230208.112717.1140830361804418505.horikyota.ntt@gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#56Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#55)
Re: Exit walsender before confirming remote flush in logical replication

At Fri, 10 Feb 2023 12:40:43 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in

Dear Amit,

Can't we have this option just as a bool (like shutdown_immediate)?
Why do we want to keep multiple modes?

Of course we can use boolean instead, but current style is motivated by the post[1].
This allows to add another option in future, whereas I do not have idea now.

I want to ask other reviewers which one is better...

[1]: /messages/by-id/20230208.112717.1140830361804418505.horikyota.ntt@gmail.com

IMHO I vaguely don't like that we lose a means to specify the default
behavior here. And I'm not sure we definitely don't need other than
flush and immedaite for both physical and logical replication. If it's
not the case, I don't object to make it a Boolean.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#57Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#56)
Re: Exit walsender before confirming remote flush in logical replication

On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Fri, 10 Feb 2023 12:40:43 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in

Dear Amit,

Can't we have this option just as a bool (like shutdown_immediate)?
Why do we want to keep multiple modes?

Of course we can use boolean instead, but current style is motivated by the post[1].
This allows to add another option in future, whereas I do not have idea now.

I want to ask other reviewers which one is better...

[1]: /messages/by-id/20230208.112717.1140830361804418505.horikyota.ntt@gmail.com

IMHO I vaguely don't like that we lose a means to specify the default
behavior here. And I'm not sure we definitely don't need other than
flush and immedaite for both physical and logical replication.

If we can think of any use case that requires its extension then it
makes sense to make it a non-boolean option but otherwise, let's keep
things simple by having a boolean option.

If it's
not the case, I don't object to make it a Boolean.

Thanks.

--
With Regards,
Amit Kapila.

#58Peter Smith
smithpb2250@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#53)
Re: Exit walsender before confirming remote flush in logical replication

Here are some comments for patch v7-0002.

======
Commit Message

1.
This commit extends START_REPLICATION to accept SHUTDOWN_MODE clause. It is
currently implemented only for logical replication.

~

"to accept SHUTDOWN_MODE clause." --> "to accept a SHUTDOWN_MODE clause."

======
doc/src/sgml/protocol.sgml

2.
START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE {
'wait_flush' | 'immediate' } ] [ ( option_name [ option_value ] [,
...] ) ]

~

IMO this should say shutdown_mode as it did before:
START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE
shutdown_mode ] [ ( option_name [ option_value ] [, ...] ) ]

~~~

3.
+       <varlistentry>
+        <term><literal>shutdown_mode</literal></term>
+        <listitem>
+         <para>
+          Determines the behavior of the walsender process at shutdown. If
+          shutdown_mode is <literal>'wait_flush'</literal>, the walsender waits
+          for all the sent WALs to be flushed on the subscriber side. This is
+          the default when SHUTDOWN_MODE is not specified. If shutdown_mode is
+          <literal>'immediate'</literal>, the walsender exits without
+          confirming the remote flush.
+         </para>
+        </listitem>
+       </varlistentry>

Is the font of the "shutdown_mode" correct? I expected it to be like
the others (e.g. slot_name)

======
src/backend/replication/walsender.c

4.
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+ if (cmd->shutdownmode)
+ {
+ char    *mode = cmd->shutdownmode;
+
+ if (pg_strcasecmp(mode, "wait_flush") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+ else if (pg_strcasecmp(mode, "immediate") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid value for shutdown mode: \"%s\"", mode),
+ errhint("Available values: wait_flush, immediate."));
+ }
+
+}

Unnecessary extra whitespace at end of the function.

======
src/include/nodes/replnodes.

5.
@@ -83,6 +83,7 @@ typedef struct StartReplicationCmd
char *slotname;
TimeLineID timeline;
XLogRecPtr startpoint;
+ char *shutdownmode;
List *options;
} StartReplicationCmd;

IMO I those the last 2 members should have a comment something like:
/* Only for logical replication */

because that will make it more clear why sometimes they are assigned
and sometimes they are not.

======
src/include/replication/walreceiver.h

6.
Should the protocol version be bumped (and documented) now that the
START REPLICATION supports a new extended syntax? Or is that done only
for messages sent by pgoutput?

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

#59Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#57)
Re: Exit walsender before confirming remote flush in logical replication

At Mon, 13 Feb 2023 08:27:01 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

IMHO I vaguely don't like that we lose a means to specify the default
behavior here. And I'm not sure we definitely don't need other than
flush and immedaite for both physical and logical replication.

If we can think of any use case that requires its extension then it
makes sense to make it a non-boolean option but otherwise, let's keep
things simple by having a boolean option.

What do you think about the need for explicitly specifying the
default? I'm fine with specifying the default using a single word,
such as WAIT_FOR_REMOTE_FLUSH.

If it's
not the case, I don't object to make it a Boolean.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#60Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#59)
Re: Exit walsender before confirming remote flush in logical replication

On 2023-02-14 10:05:40 +0900, Kyotaro Horiguchi wrote:

What do you think about the need for explicitly specifying the
default? I'm fine with specifying the default using a single word,
such as WAIT_FOR_REMOTE_FLUSH.

We obviously shouldn't force the option to be present. Why would we want to
break existing clients unnecessarily? Without it the behaviour should be
unchanged from today's.

#61Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#60)
Re: Exit walsender before confirming remote flush in logical replication

At Mon, 13 Feb 2023 17:13:43 -0800, Andres Freund <andres@anarazel.de> wrote in

On 2023-02-14 10:05:40 +0900, Kyotaro Horiguchi wrote:

What do you think about the need for explicitly specifying the
default? I'm fine with specifying the default using a single word,
such as WAIT_FOR_REMOTE_FLUSH.

We obviously shouldn't force the option to be present. Why would we want to
break existing clients unnecessarily? Without it the behaviour should be
unchanged from today's.

I didn't suggest making the option mandatory. I just suggested
providing a way to specify the default value explicitly, like in the
recent commit 746915c686.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#62Greg Sabino Mullane
htamfids@gmail.com
In reply to: Kyotaro Horiguchi (#61)
Re: Exit walsender before confirming remote flush in logical replication

Thanks for everyone's work on this, I am very interested in it getting into
a release. What is the status of this?

My use case is Patroni - when it needs to do a failover, it shuts down the
primary. However, large transactions can cause it to stay in the "shutting
down" state for a long time, which means your entire HA system is now
non-functional. I like the idea of a new flag. I'll test this out soon if
the original authors want to make a rebased patch. This thread is old, so
if I don't hear back in a bit, I'll create and test a new one myself. :)

Cheers,
Greg

#63Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Greg Sabino Mullane (#62)
Re: Exit walsender before confirming remote flush in logical replication

Dear Greg, All

I'm interested in walsender shutdown mode option as well. Unexpected waiting for
wal sender shutdown in fast mode creates some problems to follow a SLA. I also
propose to create a GUC which can be used to set the shutdown mode globally.
Once, the original author was not responded for some time, I like the idea to
create a new separate patch for walsender shutdown mode based on the work in
this thread. I'm ready to participate in patch preparation and testing.

With best regards,
Vitaly

On Tuesday, September 17, 2024 15:29 MSK, Greg Sabino Mullane <htamfids@gmail.com> wrote:

Show quoted text

Thanks for everyone's work on this, I am very interested in it getting into
a release. What is the status of this?

My use case is Patroni - when it needs to do a failover, it shuts down the
primary. However, large transactions can cause it to stay in the "shutting
down" state for a long time, which means your entire HA system is now
non-functional. I like the idea of a new flag. I'll test this out soon if
the original authors want to make a rebased patch. This thread is old, so
if I don't hear back in a bit, I'll create and test a new one myself. :)

Cheers,
Greg

#64Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#53)
1 attachment(s)
Re: Exit walsender before confirming remote flush in logical replication

Dear pgsql-hackers,

I am also interested in solving this problem, so I suggest a patch which
is based on Hayato's work shared earlier.

The problem we are solving is that the logical walsender processes currently
do not allow postgres to shut down until receiver side confirms the flush of
all data. In case of logical replication, this is not necessary. This
can lead
to an undesirable shutdown delay if, for example, apply worker is
waiting for
any locks to be released.

I agree with the opinion that the default behavior of the system should
not be
changed, as some clients may rely on the current behavior. But instead of
the START_REPLICATION parameter I propose a GUC parameter on the sender that
controls the walsender shutdown mode for all logical walsenders.the First,
the START_REPLICATION parameter places responsibility for choosing the
sender’s
shutdown semantics on the receiver side. Second, per-subscriber settings
do not
solve the problematic operational case where many walsenders exist: if
even one
of N walsender processes remains configured non-immediate, the publisher can
still be blocked. In other words, setting immediate for most subscribers but
missing one does not fix the global inability to shut down.

I also attach a tap test that reproduces the apply-worker's waiting for the
release of lock and the successful shutdown of publisher in immediate
walsender
shutdown mode.

Best Regards,
Andrey

Attachments:

0001-Introduce-a-new-GUC-logical_wal_sender_shutdown_mode.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-a-new-GUC-logical_wal_sender_shutdown_mode.patchDownload
From c82757f6aa01758b39814f4168a207174ce95957 Mon Sep 17 00:00:00 2001
From: "a.silitskiy" <a.silitskiy@postgrespro.ru>
Date: Mon, 20 Oct 2025 13:38:52 +0700
Subject: [PATCH] Introduce a new GUC 'logical_wal_sender_shutdown_mode'.

Previously, at shutdown, walsender processes were waiting to send all pending data
and ensure the all data is flushed in remote node. This mechanism was added for
supporting clean switch over, but such use-case cannot be supported for logical
replication.

New guc allows to change shutdown mode of logical walsenders without changing
default behavior.

The shutdown modes are:

1) 'wait_flush' (the default). In this mode, the walsender will wait for all
WALs to be flushed on the subscriber side, before exiting the process.

2) 'immediate'. In this mode, the walsender will exit without confirming the
remote flush. This may break the consistency between publisher and subscriber.
This mode might be useful for a system that has a high-latency network (to
reduce the amount of time for shutdown), or to allow the shutdown of the
publisher even when the worker is stuck.

Author: Andrey Silitskiy
Co-authored by: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/config.sgml                      | 27 ++++++++
 src/backend/replication/walsender.c           | 46 +++++++++++++
 src/backend/utils/misc/guc_parameters.dat     |  7 ++
 src/backend/utils/misc/guc_tables.c           |  6 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/replication/walsender.h           |  7 ++
 .../t/050_walsnd_immediate_shutdown.pl        | 66 +++++++++++++++++++
 7 files changed, 160 insertions(+)
 create mode 100644 src/test/recovery/t/050_walsnd_immediate_shutdown.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index df1c3eaaa58..c058796b0cf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4702,6 +4702,33 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical_wal_sender_shutdown_mode" xreflabel="logical_wal_sender_shutdown_mode">
+      <term><varname>logical_wal_sender_shutdown_mode</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>logical_wal_sender_shutdown_mode</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the mode in which logical walsender process will terminate
+        after receival of shutdown request. Valid values are
+        <literal>wait_flush</literal> and <literal>immediate</literal>.
+        Default value is <literal>wait_flush</literal>.
+       </para>
+       <para>
+        In <literal>wait_flush</literal> mode, the walsender will wait for all
+        WALs to be flushed on the subscriber side, before exiting the process.
+       </para>
+       <para>
+        In <literal>immediate</literal> mode, the walsender will exit without waiting
+        for data replication to the subscriber. This may break the consistency between
+        publisher and subscriber. This mode might be useful for a system that has a
+        high-latency network (to reduce the amount of time for shutdown), or to allow
+        the shutdown of the publisher even when the worker is stuck.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-synchronized-standby-slots" xreflabel="synchronized_standby_slots">
       <term><varname>synchronized_standby_slots</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fc8f8559073..5440c1672d2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -130,6 +130,9 @@ int			max_wal_senders = 10;	/* the maximum number of concurrent
 									 * walsenders */
 int			wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
 											 * data message */
+
+int			logical_wal_sender_shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+
 bool		log_replication_commands = false;
 
 /*
@@ -262,6 +265,7 @@ static void WalSndKill(int code, Datum arg);
 pg_noreturn static void WalSndShutdown(void);
 static void XLogSendPhysical(void);
 static void XLogSendLogical(void);
+pg_noreturn static void LogicalWalSndDoneImmediate(void);
 static void WalSndDone(WalSndSendDataCallback send_data);
 static void IdentifySystem(void);
 static void UploadManifest(void);
@@ -1650,6 +1654,11 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/* If we got shut down request in immediate shutdown mode, exit the process */
+		if ((got_STOPPING || got_SIGUSR2) &&
+			logical_wal_sender_shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE)
+			LogicalWalSndDoneImmediate();
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -2927,6 +2936,15 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
 
+		/*
+		 * When immediate shutdown of logical walsender is requested, we do not
+		 * wait for successfull sending of all data.
+		 */
+		if ((got_STOPPING || got_SIGUSR2) &&
+			 send_data == XLogSendLogical &&
+			 logical_wal_sender_shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE)
+			LogicalWalSndDoneImmediate();
+
 		/* If nothing remains to be sent right now ... */
 		if (WalSndCaughtUp && !pq_is_send_pending())
 		{
@@ -3575,6 +3593,34 @@ XLogSendLogical(void)
 	}
 }
 
+/*
+ * Shutdown logical walsender in immediate mode.
+ *
+ * NB: This should only be called when immediate shutdown of logical walsender
+ * was requested and shutdown signal has been received from postmaster.
+ */
+static void
+LogicalWalSndDoneImmediate()
+{
+	QueryCompletion qc;
+
+	/* Try to inform subsriber that XLOG streaming is done */
+	SetQueryCompletion(&qc, CMDTAG_COPY, 0);
+	EndCommand(&qc, DestRemote, false);
+
+	/*
+	 * Note that the output buffer may be full during immediate shutdown of
+	 * logical replication walsender. If pq_flush() is called at that time,
+	 * the walsender process will be stuck. Therefore, call pq_flush_if_writable()
+	 * instead. Receival of done message by subscriber in immediate shutdown mode is
+	 * not guaranteed.
+	 */
+	pq_flush_if_writable();
+
+	proc_exit(0);
+	abort();
+}
+
 /*
  * Shutdown if the sender is caught up.
  *
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c025..27e8c19a78d 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1833,6 +1833,13 @@
   max => 'MAX_KILOBYTES',
 },
 
+{ name => 'logical_wal_sender_shutdown_mode', type => 'enum', context => 'PGC_SIGHUP', group => 'REPLICATION_SENDING',
+  short_desc => 'Sets the mode in which logical walsender will be terminated after shutdown request.',
+  variable => 'logical_wal_sender_shutdown_mode',
+  boot_val => 'WALSND_SHUTDOWN_MODE_WAIT_FLUSH',
+  options => 'logical_wal_sender_shutdown_mode_options',
+},
+
 { name => 'maintenance_io_concurrency', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_IO',
   short_desc => 'A variant of "effective_io_concurrency" that is used for maintenance work.',
   long_desc => '0 disables simultaneous requests.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0209b2067a2..81f17850aa2 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -335,6 +335,12 @@ static const struct config_enum_entry constraint_exclusion_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry logical_wal_sender_shutdown_mode_options[] = {
+	{"wait_flush", WALSND_SHUTDOWN_MODE_WAIT_FLUSH, false},
+	{"immediate", WALSND_SHUTDOWN_MODE_IMMEDIATE, false},
+	{NULL, 0, false}
+};
+
 /*
  * Although only "on", "off", "remote_apply", "remote_write", and "local" are
  * documented, we accept all the likely variants of "on" and "off".
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6268c175298..66285974f32 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -346,6 +346,7 @@
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 #track_commit_timestamp = off	# collect timestamp of transaction commit
 				# (change requires restart)
+#logical_wal_sender_shutdown_mode = wait_flush
 
 # - Primary Server -
 
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index c3e8e191339..01956ebce33 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -24,6 +24,12 @@ typedef enum
 	CRS_USE_SNAPSHOT,
 } CRSSnapshotAction;
 
+typedef enum
+{
+	WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+	WALSND_SHUTDOWN_MODE_IMMEDIATE
+} LogicalWalSndShutdownMode;
+
 /* global state */
 extern PGDLLIMPORT bool am_walsender;
 extern PGDLLIMPORT bool am_cascading_walsender;
@@ -33,6 +39,7 @@ extern PGDLLIMPORT bool wake_wal_senders;
 /* user-settable parameters */
 extern PGDLLIMPORT int max_wal_senders;
 extern PGDLLIMPORT int wal_sender_timeout;
+extern PGDLLIMPORT int logical_wal_sender_shutdown_mode;
 extern PGDLLIMPORT bool log_replication_commands;
 
 extern void InitWalSender(void);
diff --git a/src/test/recovery/t/050_walsnd_immediate_shutdown.pl b/src/test/recovery/t/050_walsnd_immediate_shutdown.pl
new file mode 100644
index 00000000000..31415d3bed1
--- /dev/null
+++ b/src/test/recovery/t/050_walsnd_immediate_shutdown.pl
@@ -0,0 +1,66 @@
+# Checks that publisher is able to shut down without
+# waiting for sending of all pending data to subscriber
+# with logical_wal_sender_shutdown_mode = immediate
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my $out;
+
+# create publisher
+my $publisher = PostgreSQL::Test::Cluster->new('publisher');
+$publisher->init(allows_streaming => 'logical');
+# set logical_wal_sender_shutdown_mode GUC parameter to immediate
+$publisher->append_conf('postgresql.conf',
+	'wal_sender_timeout = 0
+	 logical_wal_sender_shutdown_mode = immediate');
+$publisher->start();
+
+# create subscriber
+my $subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$subscriber->init();
+$subscriber->start();
+
+# create publication for test table
+$publisher->safe_psql('postgres', q{
+	CREATE TABLE pub_test (id int PRIMARY KEY);
+	CREATE PUBLICATION pub_all FOR TABLE pub_test;
+});
+
+# create matching table on subscriber
+$subscriber->safe_psql('postgres', q{
+	CREATE TABLE pub_test (id int PRIMARY KEY);
+});
+
+# form connection string to publisher
+my $pub_connstr = $publisher->connstr;
+
+# create the subscription on subscriber
+$subscriber->safe_psql('postgres', qq{
+	CREATE SUBSCRIPTION sub_all
+	CONNECTION '$pub_connstr'
+	PUBLICATION pub_all;
+});
+
+# Wait for initial sync to finish
+$subscriber->wait_for_subscription_sync($publisher, 'sub_all');
+
+# create background psql session
+my $bpgsql = $subscriber->background_psql('postgres', on_error_stop => 0);
+
+# start transaction on subscriber to hold locks
+$bpgsql->query_safe(
+	"BEGIN; INSERT INTO pub_test VALUES (1), (2), (3);"
+);
+
+# run concurrent transaction on publisher and commit
+$out = $publisher->safe_psql('postgres', 'BEGIN; INSERT INTO pub_test VALUES (1), (2), (3); COMMIT;');
+ok($out eq "", "Concurrent transaction was committed on publisher");
+
+# shutdown publisher
+$publisher->stop('fast');
+
+done_testing();
-- 
2.34.1

#65Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#64)
Re: Exit walsender before confirming remote flush in logical replication

On Tue, Nov 18, 2025 at 7:32 PM Andrey Silitskiy
<a.silitskiy@postgrespro.ru> wrote:

Dear pgsql-hackers,

I am also interested in solving this problem, so I suggest a patch which
is based on Hayato's work shared earlier.

+1
Thanks for the patch!

+{ name => 'logical_wal_sender_shutdown_mode', type => 'enum', context
=> 'PGC_SIGHUP', group => 'REPLICATION_SENDING',

How about using PGC_USERSET instead of PGC_SIGHUP, similar to
wal_sender_timeout?
That would allow setting logical_wal_sender_shutdown_mode per walsender
by assigning it to the logical replication user on the publisher and specifying
that user in the CONNECTION clause of CREATE SUBSCRIPTION command. For example:

# publisher
=# ALTER ROLE testuser SET logical_wal_sender_shutdown_mode TO 'immediate';

# subscriber
=# CREATE SUBSCRIPTION ... CONNECTION '... user=testuser' ...;

Even if the publisher's postgresql.conf sets logical_wal_sender_shutdown_mode to
'wait_flush', the per-role setting would take effect for that connection.
This gives users per-connection control, just like with parameters such as
wal_sender_timeout.

Also if the patch I proposed in [1]/messages/by-id/CAHGQGwGYV+-abbKwdrM2UHUe-JYOFWmsrs6=QicyJO-j+-Widw@mail.gmail.com is committed, the same per-connection
control could be done directly via CREATE SUBSCRIPTION:

# subscriber
=# CREATE SUBSCRIPTOIN ... CONNECTION '... options=''-c
logical_wal_sender_shutdown_mode=immediate'''

+        Specifies the mode in which logical walsender process will terminate
+        after receival of shutdown request. Valid values are
+        <literal>wait_flush</literal> and <literal>immediate</literal>.
+        Default value is <literal>wait_flush</literal>.

Shouldn't physical replication walsenders also honor this parameter?
For example, the immediate mode seems useful for physical walsenders connected
from a very remote standby (e.g., DR site). Thought?

Regards,

[1]: /messages/by-id/CAHGQGwGYV+-abbKwdrM2UHUe-JYOFWmsrs6=QicyJO-j+-Widw@mail.gmail.com

--
Fujii Masao

#66Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#65)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Nov 19, 2025 at 8:46 PM Fujii Masao
<masao(dot)fujii(at)gmail(dot)com> wrote:

How about using PGC_USERSET instead of PGC_SIGHUP, similar to
wal_sender_timeout?

Dear Fujii, thanks for the review!

Current version of the patch suggests changing the shutdown mode of
logical senders globally for the server. As I wrote above: patch
excludes receiver's side decision whether the sender is allowed to hang
on shutdown. In addition, it provides simpler administration of a system.
But I'm ready to hear other opinions on this matter.

Shouldn't physical replication walsenders also honor this parameter?
For example, the immediate mode seems useful for physical walsenders

connected

from a very remote standby (e.g., DR site). Thought?

As discussed earlier, physical replication is more sensitive to data
divergence and there is no problem with apply_worker and backend lock
conflict, which makes the use-case more narrow.

By the way, does anyone find the name of IMMEDIATE mode too similar to
the "pg_ctl stop" mode and a little confusing? Initially, I planned
to call this mode WALSND_SHUTDOWN_MODE_FORCED instead of
WALSND_SHUTDOWN_MODE_IMMEDIATE.

Best Regards,
Andrey Silitskiy

#67Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#66)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Nov 20, 2025 at 4:05 PM Andrey Silitskiy
<a.silitskiy@postgrespro.ru> wrote:

On Wed, Nov 19, 2025 at 8:46 PM Fujii Masao
<masao(dot)fujii(at)gmail(dot)com> wrote:

How about using PGC_USERSET instead of PGC_SIGHUP, similar to
wal_sender_timeout?

Dear Fujii, thanks for the review!

Current version of the patch suggests changing the shutdown mode of
logical senders globally for the server. As I wrote above: patch
excludes receiver's side decision whether the sender is allowed to hang
on shutdown. In addition, it provides simpler administration of a system.

Even with PGC_USERSET instead of PGC_SIGHUP, we can still control
the shutdown mode globally by setting it in postgresql.conf. The difference
is that PGC_USERSET also allows per–replication-user overrides when needed,
which gives users more flexibility without losing the ability to
set a server-wide setting, I think.

As discussed earlier, physical replication is more sensitive to data
divergence and there is no problem with apply_worker and backend lock
conflict, which makes the use-case more narrow.

I think there are valid use cases for applying this setting to
physical replication as well. For example, please consider a system
that has generated a large amount of WAL due to bulk loading,
and a remote standby with a slow or low-bandwidth network link.
In such a case, some would think an immediate shutdown could be desirable
rather than waiting a long time for all outstanding WAL to be sent.

Of course, misconfiguring this parameter for physical replication could
lead to serious issues. So if we decide to apply it to physical walsenders,
the docs might need to clearly explain the risks so that users can make
informed decisions, like we've already done for other parameters like fsync,
full_page_writes, etc.

Regards,

--
Fujii Masao

#68Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#67)
1 attachment(s)
Re: Exit walsender before confirming remote flush in logical replication

On Nov 23, 2025 at 11:46 PM Fujii Masao
<masao(dot)fujii(at)gmail(dot)com> wrote:

The difference is that PGC_USERSET also allows per–replication-user
overrides when needed, which gives users more flexibility without
losing the ability to set a server-wide setting, I think.
...
I think there are valid use cases for applying this setting to
physical replication as well.

Thanks for the comments. I agree, this parameter also seems usable
for physical replication, if you use it with caution. In this case,
it really becomes useful to be able to configure a parameter for
each connection. I have added these changes to my patch.

Also, earlier I did not mention another difference between my patch
and those discussed earlier. Previously, even in immediate mode,
WalSndCaughtUp flag was checked before calling WalSndDone,
and this made it impossible to shut down even in immediate mode
with WalSndCaughtUp = false when the server has full output buffers.
This does not happen in the current patch implementation. I added
an additional test case for this situation.

Regards,
Andrey Silitskiy

Attachments:

v2-0001-Introduce-a-new-GUC-wal_sender_shutdown_mode.patchtext/x-patch; charset=UTF-8; name=v2-0001-Introduce-a-new-GUC-wal_sender_shutdown_mode.patchDownload
From 6c06cc76cf32d7f47c93b6de9f92aa54eaeead3c Mon Sep 17 00:00:00 2001
From: "a.silitskiy" <a.silitskiy@postgrespro.ru>
Date: Thu, 27 Nov 2025 15:52:34 +0700
Subject: [PATCH v2] Introduce a new GUC 'wal_sender_shutdown_mode'.

Previously, at shutdown, walsender processes were always waiting to send all
pending data and ensure that all data is flushed in remote node. But in some cases
an unexpected wait may be unacceptable. For example, in logical replication,
apply_workers may hang on locks for some time, excluding the possibility of
sender's shutdown.

New guc allows to change shutdown mode of walsenders without changing
default behavior.

The shutdown modes are:

1) 'wait_flush' (the default). In this mode, the walsender will wait for all
WALs to be flushed on the receiver side, before exiting the process.

2) 'immediate'. In this mode, the walsender will exit without confirming the
remote flush. This may break the consistency between sender and receiver.
This mode might be useful for a system that has a high-latency network (to
reduce the amount of time for shutdown), or to allow the shutdown of
publisher even when when the subscriber's apply_worker is waiting for any
locks to be released.

Author: Andrey Silitskiy
Co-authored by: Hayato Kuroda
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
---
 doc/src/sgml/config.sgml                      |  33 +++++
 src/backend/replication/walsender.c           |  44 ++++++
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/guc_tables.c           |   6 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/replication/walsender.h           |   7 +
 .../t/037_walsnd_immediate_shutdown.pl        | 136 ++++++++++++++++++
 7 files changed, 235 insertions(+)
 create mode 100644 src/test/subscription/t/037_walsnd_immediate_shutdown.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 737b90736bf..3aa09f90a65 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4708,6 +4708,39 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal_sender_shutdown_mode" xreflabel="wal_sender_shutdown_mode">
+      <term><varname>wal_sender_shutdown_mode</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_sender_shutdown_mode</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the mode in which walsender process will terminate
+        after receival of shutdown request. Valid values are <literal>wait_flush</literal>
+        and <literal>immediate</literal>. Default value is <literal>wait_flush</literal>.
+        Can be set for each walsender.
+       </para>
+       <para>
+        In <literal>wait_flush</literal> mode, the walsender will wait for all
+        WALs to be flushed on the receiver side, before exiting the process. May
+        lead to unexpected lag of server shutdown.
+       </para>
+       <para>
+        In <literal>immediate</literal> mode, the walsender will exit without waiting
+        for data replication to the receiver. This may break data consistency between
+        sender and receiver after shutdown, which can be especially important in
+        case of physical replication and switch-over.
+       </para>
+       <para>
+        This mode might be useful for a system that has a high-latency network (to
+        reduce the amount of time for shutdown), or to allow the shutdown of
+        logical replication walsender even when the subscriber's apply_worker
+        is waiting for any locks to be released.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-synchronized-standby-slots" xreflabel="synchronized_standby_slots">
       <term><varname>synchronized_standby_slots</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fc8f8559073..298b167d1e2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -130,6 +130,9 @@ int			max_wal_senders = 10;	/* the maximum number of concurrent
 									 * walsenders */
 int			wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
 											 * data message */
+
+int			wal_sender_shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+
 bool		log_replication_commands = false;
 
 /*
@@ -262,6 +265,7 @@ static void WalSndKill(int code, Datum arg);
 pg_noreturn static void WalSndShutdown(void);
 static void XLogSendPhysical(void);
 static void XLogSendLogical(void);
+pg_noreturn static void WalSndDoneImmediate(void);
 static void WalSndDone(WalSndSendDataCallback send_data);
 static void IdentifySystem(void);
 static void UploadManifest(void);
@@ -1650,6 +1654,11 @@ ProcessPendingWrites(void)
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
+
+		/* If we got shut down request in immediate shutdown mode, exit the process */
+		if ((got_STOPPING || got_SIGUSR2) &&
+			 wal_sender_shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE)
+			WalSndDoneImmediate();
 	}
 
 	/* reactivate latch so WalSndLoop knows to continue */
@@ -2927,6 +2936,14 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		if (pq_flush_if_writable() != 0)
 			WalSndShutdown();
 
+		/*
+		 * When immediate shutdown of walsender is requested, we do not
+		 * wait for successfull sending of all data.
+		 */
+		if ((got_STOPPING || got_SIGUSR2) &&
+			 wal_sender_shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE)
+			WalSndDoneImmediate();
+
 		/* If nothing remains to be sent right now ... */
 		if (WalSndCaughtUp && !pq_is_send_pending())
 		{
@@ -3575,6 +3592,33 @@ XLogSendLogical(void)
 	}
 }
 
+/*
+ * Shutdown walsender in immediate mode.
+ *
+ * NB: This should only be called when immediate shutdown of walsender
+ * was requested and shutdown signal has been received from postmaster.
+ */
+static void
+WalSndDoneImmediate()
+{
+	QueryCompletion qc;
+
+	/* Try to inform receiver that XLOG streaming is done */
+	SetQueryCompletion(&qc, CMDTAG_COPY, 0);
+	EndCommand(&qc, DestRemote, false);
+
+	/*
+	 * Note that the output buffer may be full during immediate shutdown of
+	 * walsender. If pq_flush() is called at that time, the walsender process
+	 * will be stuck. Therefore, call pq_flush_if_writable() instead. Successfull
+	 * receival of done message in immediate shutdown mode is not guaranteed.
+	 */
+	pq_flush_if_writable();
+
+	proc_exit(0);
+	abort();
+}
+
 /*
  * Shutdown if the sender is caught up.
  *
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..6911635a85a 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -3422,6 +3422,13 @@
   check_hook => 'check_wal_segment_size',
 },
 
+{ name => 'wal_sender_shutdown_mode', type => 'enum', context => 'PGC_USERSET', group => 'REPLICATION_SENDING',
+  short_desc => 'Sets the mode in which walsender will be terminated after shutdown request.',
+  variable => 'wal_sender_shutdown_mode',
+  boot_val => 'WALSND_SHUTDOWN_MODE_WAIT_FLUSH',
+  options => 'wal_sender_shutdown_mode_options',
+},
+
 { name => 'wal_sender_timeout', type => 'int', context => 'PGC_USERSET', group => 'REPLICATION_SENDING',
   short_desc => 'Sets the maximum time to wait for WAL replication.',
   flags => 'GUC_UNIT_MS',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f87b558c2c6..3306ebb39f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -335,6 +335,12 @@ static const struct config_enum_entry constraint_exclusion_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry wal_sender_shutdown_mode_options[] = {
+	{"wait_flush", WALSND_SHUTDOWN_MODE_WAIT_FLUSH, false},
+	{"immediate", WALSND_SHUTDOWN_MODE_IMMEDIATE, false},
+	{NULL, 0, false}
+};
+
 /*
  * Although only "on", "off", "remote_apply", "remote_write", and "local" are
  * documented, we accept all the likely variants of "on" and "off".
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..111ae7dea48 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -346,6 +346,8 @@
 #wal_sender_timeout = 60s       # in milliseconds; 0 disables
 #track_commit_timestamp = off   # collect timestamp of transaction commit
                                 # (change requires restart)
+#wal_sender_shutdown_mode = wait_flush	# walsender termination mode after
+                                # receival of shutdown request
 
 # - Primary Server -
 
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index c3e8e191339..228f3f0aa2c 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -24,6 +24,12 @@ typedef enum
 	CRS_USE_SNAPSHOT,
 } CRSSnapshotAction;
 
+typedef enum
+{
+	WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+	WALSND_SHUTDOWN_MODE_IMMEDIATE
+} WalSndShutdownMode;
+
 /* global state */
 extern PGDLLIMPORT bool am_walsender;
 extern PGDLLIMPORT bool am_cascading_walsender;
@@ -33,6 +39,7 @@ extern PGDLLIMPORT bool wake_wal_senders;
 /* user-settable parameters */
 extern PGDLLIMPORT int max_wal_senders;
 extern PGDLLIMPORT int wal_sender_timeout;
+extern PGDLLIMPORT int wal_sender_shutdown_mode;
 extern PGDLLIMPORT bool log_replication_commands;
 
 extern void InitWalSender(void);
diff --git a/src/test/subscription/t/037_walsnd_immediate_shutdown.pl b/src/test/subscription/t/037_walsnd_immediate_shutdown.pl
new file mode 100644
index 00000000000..8bbf64df8e0
--- /dev/null
+++ b/src/test/subscription/t/037_walsnd_immediate_shutdown.pl
@@ -0,0 +1,136 @@
+# Checks that publisher is able to shut down without
+# waiting for sending of all pending data to subscriber
+# with wal_sender_shutdown_mode = immediate
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+sub run_with_timeout
+{
+    my ($seconds, $code_to_run) = @_;
+
+    eval
+	{
+        local $SIG{ALRM} = sub { die "timeout\n" };
+        alarm $seconds;
+        $code_to_run->();
+        alarm 0;
+    };
+
+    return $@;   # Empty string means success or contains error text.
+}
+
+sub ok_with_timeout
+{
+    my ($seconds, $code, $msg) = @_;
+    my $err = run_with_timeout($seconds, $code);
+
+    if (!$err)
+	{
+        pass($msg);
+    }
+    elsif ($err =~ /timeout/)
+	{
+        fail("$msg (timed out after $seconds seconds)");
+    }
+    else
+	{
+        fail("$msg (unknown error: $err)");
+    }
+}
+
+my $out;
+
+# create publisher
+my $publisher = PostgreSQL::Test::Cluster->new('publisher');
+$publisher->init(allows_streaming => 'logical');
+# set wal_sender_shutdown_mode GUC parameter to immediate
+$publisher->append_conf('postgresql.conf',
+	'wal_sender_timeout = 0
+	 wal_sender_shutdown_mode = immediate');
+$publisher->start();
+
+# create subscriber
+my $subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$subscriber->init();
+$subscriber->start();
+
+# create publication for test table
+$publisher->safe_psql('postgres', q{
+	CREATE TABLE pub_test (id int PRIMARY KEY);
+	CREATE PUBLICATION pub_all FOR TABLE pub_test;
+});
+
+# create matching table on subscriber
+$subscriber->safe_psql('postgres', q{
+	CREATE TABLE pub_test (id int PRIMARY KEY);
+});
+
+# form connection string to publisher
+my $pub_connstr = $publisher->connstr;
+
+# create the subscription on subscriber
+$subscriber->safe_psql('postgres', qq{
+	CREATE SUBSCRIPTION sub_all
+	CONNECTION '$pub_connstr'
+	PUBLICATION pub_all;
+});
+
+# wait for initial sync to finish
+$subscriber->wait_for_subscription_sync($publisher, 'sub_all');
+
+# create background psql session
+my $bpgsql = $subscriber->background_psql('postgres', on_error_stop => 0);
+
+
+# =============================================================================
+# Testcase start: Shutdown of publisher with empty output buffers
+
+# start transaction on subscriber to hold locks
+$bpgsql->query_safe(
+	"BEGIN; INSERT INTO pub_test VALUES (0);"
+);
+
+# run concurrent transaction on publisher and commit
+$out = $publisher->safe_psql('postgres', 'BEGIN; INSERT INTO pub_test VALUES (0); COMMIT;');
+ok($out eq "", "Concurrent transaction was committed on publisher");
+
+# test publisher shutdown
+ok_with_timeout(5, sub { $publisher->stop('fast') },
+                "Successfull fast shutdown of server with empty output buffers");
+
+# Testcase end: Shutdown of publisher with empty output buffers
+# =============================================================================
+
+$bpgsql->query_safe(
+	"ABORT;"
+);
+
+# restart publisher for the next testcase
+$publisher->start();
+
+$subscriber->wait_for_subscription_sync($publisher, 'sub_all');
+
+# =============================================================================
+# Testcase start: Shutdown of publisher with full output buffers
+
+# lock table to make apply_worker hang
+$bpgsql->query_safe(
+	"BEGIN; LOCK TABLE pub_test IN EXCLUSIVE MODE;"
+);
+
+# generate big amount of wal records for locked table
+$out = $publisher->safe_psql('postgres', 'BEGIN; INSERT INTO pub_test SELECT i from generate_series(1, 20000) s(i); COMMIT;');
+ok($out eq "", "Inserts into locked table successfully generated");
+
+# test publisher shutdown
+ok_with_timeout(5, sub { $publisher->stop('fast') },
+                "Successfull fast shutdown of server with full output buffers");
+
+# Testcase end: Shutdown of publisher with full output buffers
+# =============================================================================
+
+done_testing();
-- 
2.34.1

#69Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrey Silitskiy (#68)
Re: Exit walsender before confirming remote flush in logical replication

Hi, Andrey!

On Thu, Nov 27, 2025 at 12:19 PM Andrey Silitskiy
<a.silitskiy@postgrespro.ru> wrote:

On Nov 23, 2025 at 11:46 PM Fujii Masao
<masao(dot)fujii(at)gmail(dot)com> wrote:

The difference is that PGC_USERSET also allows per–replication-user
overrides when needed, which gives users more flexibility without
losing the ability to set a server-wide setting, I think.
...
I think there are valid use cases for applying this setting to
physical replication as well.

Thanks for the comments. I agree, this parameter also seems usable
for physical replication, if you use it with caution. In this case,
it really becomes useful to be able to configure a parameter for
each connection. I have added these changes to my patch.

Also, earlier I did not mention another difference between my patch
and those discussed earlier. Previously, even in immediate mode,
WalSndCaughtUp flag was checked before calling WalSndDone,
and this made it impossible to shut down even in immediate mode
with WalSndCaughtUp = false when the server has full output buffers.
This does not happen in the current patch implementation. I added
an additional test case for this situation.

Thank you for reviving this thread. I think it is reasonable to move
control over the walsender shutdown behavior to the primary server. I
see an analogy with synchronous_commit and synchronous_standby_names.
Primary decides which standbys wait and which way to wait for them.
Similarly, the primary should decide who to wait on the shutdown.

I would like to make a couple of suggestions for the patch.
1) I think it's useful to tune particular standbys/subscribers to
specify the walsender shutdown mode. It was possible in the patch by
Hayato Kuroda, and it would be a pity to lose. I suggest implementing
the walsender shutdown mode as a replication slot option.
2) Given that walsender shutdown mode would be a replication slot
option, I propose to rename GUC to default_wal_sender_shutdown_mode.
Also, given we would more likely need to wait for a flush during
streaming replication, I would suggest following modes: immediate,
wait_for_flush_streaming_only, wait_for_flush. The new intermediate
option would make walsender wait for a flush only for physical
standbys but not for logical subscribers.

What do you think?

------
Regards,
Alexander Korotkov
Supabase

#70Fujii Masao
masao.fujii@gmail.com
In reply to: Alexander Korotkov (#69)
Re: Exit walsender before confirming remote flush in logical replication

On Sat, Jan 3, 2026 at 9:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Andrey!

On Thu, Nov 27, 2025 at 12:19 PM Andrey Silitskiy
<a.silitskiy@postgrespro.ru> wrote:

On Nov 23, 2025 at 11:46 PM Fujii Masao
<masao(dot)fujii(at)gmail(dot)com> wrote:

The difference is that PGC_USERSET also allows per–replication-user
overrides when needed, which gives users more flexibility without
losing the ability to set a server-wide setting, I think.
...
I think there are valid use cases for applying this setting to
physical replication as well.

Thanks for the comments. I agree, this parameter also seems usable
for physical replication, if you use it with caution. In this case,
it really becomes useful to be able to configure a parameter for
each connection. I have added these changes to my patch.

Also, earlier I did not mention another difference between my patch
and those discussed earlier. Previously, even in immediate mode,
WalSndCaughtUp flag was checked before calling WalSndDone,
and this made it impossible to shut down even in immediate mode
with WalSndCaughtUp = false when the server has full output buffers.
This does not happen in the current patch implementation. I added
an additional test case for this situation.

Thank you for reviving this thread. I think it is reasonable to move
control over the walsender shutdown behavior to the primary server. I
see an analogy with synchronous_commit and synchronous_standby_names.
Primary decides which standbys wait and which way to wait for them.
Similarly, the primary should decide who to wait on the shutdown.

I would like to make a couple of suggestions for the patch.
1) I think it's useful to tune particular standbys/subscribers to
specify the walsender shutdown mode. It was possible in the patch by
Hayato Kuroda, and it would be a pity to lose. I suggest implementing
the walsender shutdown mode as a replication slot option.

Even with the proposed patch, this can already be done by setting
wal_sender_shutdown_mode in primary_conninfo for physical
replication, or in the CONNECTION clause of CREATE SUBSCRIPTION for
logical replication. For example:

CREATE SUBSCRIPTION ... CONNECTION 'options=''-c
wal_sender_shutdown_mode=immediate''' ...

This allows wal_sender_shutdown_mode in postgresql.conf on the
primary or publisher to act as the default, while different values can
be specified per replication connection via primary_conninfo or the
CONNECTION clause. Thought?

Regards,

--
Fujii Masao

#71Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#68)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Nov 27, 2025 at 7:19 PM Andrey Silitskiy
<a.silitskiy@postgrespro.ru> wrote:

On Nov 23, 2025 at 11:46 PM Fujii Masao
<masao(dot)fujii(at)gmail(dot)com> wrote:

The difference is that PGC_USERSET also allows per–replication-user
overrides when needed, which gives users more flexibility without
losing the ability to set a server-wide setting, I think.
...
I think there are valid use cases for applying this setting to
physical replication as well.

Thanks for the comments. I agree, this parameter also seems usable
for physical replication, if you use it with caution. In this case,
it really becomes useful to be able to configure a parameter for
each connection. I have added these changes to my patch.

Thanks for updating the patch!

+ /* Try to inform receiver that XLOG streaming is done */
+ SetQueryCompletion(&qc, CMDTAG_COPY, 0);
+ EndCommand(&qc, DestRemote, false);
+
+ /*
+ * Note that the output buffer may be full during immediate shutdown of
+ * walsender. If pq_flush() is called at that time, the walsender process
+ * will be stuck. Therefore, call pq_flush_if_writable() instead. Successfull
+ * receival of done message in immediate shutdown mode is not guaranteed.
+ */
+ pq_flush_if_writable();

Why do we need to send a "done" message to the receiver here?
Since delivery isn't guaranteed in immediate mode, it seems of limited value.

If it isn't necessary, do we need WalSndDoneImmediate() at all,
or could we just reuse WalSndShutdown() for immediate mode?

For the immediate mode, would it make sense to log that the walsender is
terminating in immediate mode and that WAL replication may be incomplete,
so users can more easily understand what happened?

Regards,

--
Fujii Masao