Switching XLog source from archive to streaming when primary available

Started by SATYANARAYANA NARLAPURAMover 4 years ago61 messageshackers
Jump to latest
#1SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog
source from streaming to archive and continues in that state until it can
get the WAL from the archive location. On a server with high WAL activity,
typically getting the WAL from the archive is slower than streaming it from
the primary and couldn't exit from that state. This not only increases the
lag on the standby but also adversely impacts the primary as the WAL gets
accumulated, and vacuum is not able to collect the dead tuples. DBAs as a
mitigation can however remove/advance the slot or remove the
restore_command on the standby but this is a manual work I am trying to
avoid. I would like to propose the following, please let me know your
thoughts.

- Automatically attempt to switch the source from Archive to streaming
when the primary_conninfo is set after replaying 'N' wal segment governed
by the GUC retry_primary_conn_after_wal_segments
- when retry_primary_conn_after_wal_segments is set to -1 then the
feature is disabled
- When the retry attempt fails, then switch back to the archive

Thanks,
Satya

#2Dilip Kumar
dilipbalaut@gmail.com
In reply to: SATYANARAYANA NARLAPURAM (#1)
Re: Switching XLog source from archive to streaming when primary available

On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.

Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
when retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
When the retry attempt fails, then switch back to the archive

I think there is another thread [1]/messages/by-id/CAKYtNApe05WmeRo92gTePEmhOM4myMpCK_+ceSJtC7-AWLw1qw@mail.gmail.com that is logically trying to solve
a similar issue, basically, in the main recovery apply loop is the
walreceiver does not exist then it is launching the walreceiver.
However, in that patch, it is not changing the current Xlog source but
I think that is not a good idea because with that it will restore from
the archive as well as stream from the primary so I have given that
review comment on that thread as well. One big difference is that
patch is launching the walreceiver even if the WAL is locally
available and we don't really need more WAL but that is controlled by
a GUC.

[1]: /messages/by-id/CAKYtNApe05WmeRo92gTePEmhOM4myMpCK_+ceSJtC7-AWLw1qw@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#3Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: SATYANARAYANA NARLAPURAM (#1)
Re: Switching XLog source from archive to streaming when primary available

On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.

Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
when retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
When the retry attempt fails, then switch back to the archive

I've gone through the state machine in WaitForWALToBecomeAvailable and
I understand it this way: failed to receive WAL records from the
primary causes the current source to switch to archive and the standby
continues to get WAL records from archive location unless some failure
occurs there the current source is never going to switch back to
stream. Given the fact that getting WAL from archive location causes
delay in production environments, we miss to take the advantage of the
reconnection to primary after previous failed attempt.

So basically, we try to attempt to switch to streaming from archive
(even though fetching from archive can succeed) after a certain amount
of time or WAL segments. I prefer timing-based switch to streaming
from archive instead of after a number of WAL segments fetched from
archive. Right now, wal_retrieve_retry_interval is being used to wait
before switching to archive after failed attempt from streaming, IMO,
a similar GUC (that gets set once the source switched from streaming
to archive and on timeout it switches to streaming again) can be used
to switch from archive to streaming after the specified amount of
time.

Thoughts?

Regards,
Bharath Rupireddy.

#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#3)
Re: Switching XLog source from archive to streaming when primary available

On Sat, Apr 30, 2022 at 6:19 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.

Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
when retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
When the retry attempt fails, then switch back to the archive

I've gone through the state machine in WaitForWALToBecomeAvailable and
I understand it this way: failed to receive WAL records from the
primary causes the current source to switch to archive and the standby
continues to get WAL records from archive location unless some failure
occurs there the current source is never going to switch back to
stream. Given the fact that getting WAL from archive location causes
delay in production environments, we miss to take the advantage of the
reconnection to primary after previous failed attempt.

So basically, we try to attempt to switch to streaming from archive
(even though fetching from archive can succeed) after a certain amount
of time or WAL segments. I prefer timing-based switch to streaming
from archive instead of after a number of WAL segments fetched from
archive. Right now, wal_retrieve_retry_interval is being used to wait
before switching to archive after failed attempt from streaming, IMO,
a similar GUC (that gets set once the source switched from streaming
to archive and on timeout it switches to streaming again) can be used
to switch from archive to streaming after the specified amount of
time.

Thoughts?

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

Regards,
Bharath Rupireddy.

Attachments:

v1-0001-Switch-to-stream-mode-from-archive-occasionally.patchapplication/octet-stream; name=v1-0001-Switch-to-stream-mode-from-archive-occasionally.patchDownload+53-1
#5Cary Huang
cary.huang@highgo.ca
In reply to: Bharath Rupireddy (#4)
Re: Switching XLog source from archive to streaming when primary available

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Hello

I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during a very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and force standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds before it attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive until standby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems that the patch does the job as described and does bring some benefit during a very large REDO job where it will try to re-stream after restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a large one, PG is already switching back to streaming once it hits consistent state.

thank you

Cary Huang
HighGo Software Canada

#6Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Cary Huang (#5)
Re: Switching XLog source from archive to streaming when primary available

On Sat, Jun 25, 2022 at 1:31 AM Cary Huang <cary.huang@highgo.ca> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Hello

I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during a very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and force standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds before it attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive until standby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems that the patch does the job as described and does bring some benefit during a very large REDO job where it will try to re-stream after restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a large one, PG is already switching back to streaming once it hits consistent state.

Thanks a lot Cary for testing the patch.

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

It will be great if I can hear some thoughts on the above points (as
posted upthread).

Regards,
Bharath Rupireddy.

#7Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#6)
Re: Switching XLog source from archive to streaming when primary available

On Fri, Jul 8, 2022 at 9:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Jun 25, 2022 at 1:31 AM Cary Huang <cary.huang@highgo.ca> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Hello

I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during a very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and force standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds before it attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive until standby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems that the patch does the job as described and does bring some benefit during a very large REDO job where it will try to re-stream after restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a large one, PG is already switching back to streaming once it hits consistent state.

Thanks a lot Cary for testing the patch.

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

It will be great if I can hear some thoughts on the above points (as
posted upthread).

Here's the v2 patch with a separate GUC, new GUC was necessary as the
existing GUC wal_retrieve_retry_interval is loaded with multiple
usages. When the feature is enabled, it will let standby to switch to
stream mode i.e. fetch WAL from primary before even fetching from
archive fails. The switching to stream mode from archive happens in 2
scenarios: 1) when standby is in initial recovery 2) when there was a
failure in receiving from primary (walreceiver got killed or crashed
or timed out, or connectivity to primary was broken - for whatever
reasons).

I also added test cases to the v2 patch.

Please review the patch.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachments:

v2-0001-Switch-WAL-source-to-stream-from-archive.patchapplication/octet-stream; name=v2-0001-Switch-WAL-source-to-stream-from-archive.patchDownload+265-5
#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#7)
Re: Switching XLog source from archive to streaming when primary available
+      <indexterm>
+       <primary><varname>wal_source_switch_interval</varname> configuration parameter</primary>
+      </indexterm>

I don't want to bikeshed on the name too much, but I do think we need
something more descriptive. I'm thinking of something like
streaming_replication_attempt_interval or
streaming_replication_retry_interval.

+        Specifies how long the standby server should wait before switching WAL
+        source from WAL archive to primary (streaming replication). This can
+        happen either during the standby initial recovery or after a previous
+        failed attempt to stream WAL from the primary.

I'm not sure what the second sentence means. In general, I think the
explanation in your commit message is much clearer:

The standby makes an attempt to read WAL from primary after
wal_retrieve_retry_interval milliseconds reading from archive.

+        If this value is specified without units, it is taken as milliseconds.
+        The default value is 5 seconds. A setting of <literal>0</literal>
+        disables the feature.

5 seconds seems low. I would expect the default to be 1-5 minutes. I
think it's important to strike a balance between interrupting archive
recovery to attempt streaming replication and letting archive recovery make
progress.

+	 * Try reading WAL from primary after every wal_source_switch_interval
+	 * milliseconds, when state machine is in XLOG_FROM_ARCHIVE state. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.

It's not clear to me how this is expected to interact with the pg_wal phase
of standby recovery. As the docs note [0]https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION, standby servers loop through
archive recovery, recovery from pg_wal, and streaming replication. Does
this cause the pg_wal phase to be skipped (i.e., the standby goes straight
from archive recovery to streaming replication)? I wonder if it'd be
better for this mechanism to simply move the standby to the pg_wal phase so
that the usual ordering isn't changed.

+					if (!first_time &&
+						TimestampDifferenceExceeds(last_switch_time, curr_time,
+												   wal_source_switch_interval))

Shouldn't this also check that wal_source_switch_interval is not set to 0?

+						elog(DEBUG2,
+							 "trying to switch WAL source to %s after fetching WAL from %s for %d milliseconds",
+							 xlogSourceNames[XLOG_FROM_STREAM],
+							 xlogSourceNames[currentSource],
+							 wal_source_switch_interval);
+
+						last_switch_time = curr_time;

Shouldn't the last_switch_time be set when the state machine first enters
XLOG_FROM_ARCHIVE? IIUC this logic is currently counting time spent
elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
source switch. This would mean that a standby that has spent a lot of time
in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
XLOG_FROM_ARCHIVE when it failed again. Given the standby will wait for
wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
seems like we could end up rapidly looping between sources. Perhaps I am
misunderstanding how this is meant to work.

+	{
+		{"wal_source_switch_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time to wait before switching WAL source from archive to primary"),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&wal_source_switch_interval,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},

I wonder if the lower bound should be higher to avoid switching
unnecessarily rapidly between WAL sources. I see that
WaitForWALToBecomeAvailable() ensures that standbys do not switch from
XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
wal_retrieve_retry_interval. Perhaps wal_retrieve_retry_interval should be
the lower bound for this GUC, too. Or maybe WaitForWALToBecomeAvailable()
should make sure that the standby makes at least once attempt to restore
the file from archive before switching to streaming replication.

[0]: https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#9Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#8)
Re: Switching XLog source from archive to streaming when primary available

On Wed, Sep 7, 2022 at 3:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

+      <indexterm>
+       <primary><varname>wal_source_switch_interval</varname> configuration parameter</primary>
+      </indexterm>

I don't want to bikeshed on the name too much, but I do think we need
something more descriptive. I'm thinking of something like
streaming_replication_attempt_interval or
streaming_replication_retry_interval.

I could come up with wal_source_switch_interval after a log of
bikeshedding myself :). However, streaming_replication_retry_interval
looks much better, I've used it in the latest patch. Thanks.

+        Specifies how long the standby server should wait before switching WAL
+        source from WAL archive to primary (streaming replication). This can
+        happen either during the standby initial recovery or after a previous
+        failed attempt to stream WAL from the primary.

I'm not sure what the second sentence means. In general, I think the
explanation in your commit message is much clearer:

I polished the comments, docs and commit message a bit, please check now.

5 seconds seems low. I would expect the default to be 1-5 minutes. I
think it's important to strike a balance between interrupting archive
recovery to attempt streaming replication and letting archive recovery make
progress.

+1 for a default value of 5 minutes to avoid frequent interruptions
for archive mode when primary is really down for a long time. I've
also added a cautionary note in the docs about the lower values.

+        * Try reading WAL from primary after every wal_source_switch_interval
+        * milliseconds, when state machine is in XLOG_FROM_ARCHIVE state. If
+        * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+        * it falls back to XLOG_FROM_ARCHIVE state.

It's not clear to me how this is expected to interact with the pg_wal phase
of standby recovery. As the docs note [0], standby servers loop through
archive recovery, recovery from pg_wal, and streaming replication. Does
this cause the pg_wal phase to be skipped (i.e., the standby goes straight
from archive recovery to streaming replication)? I wonder if it'd be
better for this mechanism to simply move the standby to the pg_wal phase so
that the usual ordering isn't changed.

[0] https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION

It doesn't change any behaviour as such for XLOG_FROM_PG_WAL. In
standby mode when recovery_command is specified, the initial value of
currentSource would be XLOG_FROM_ARCHIVE (see [1]if (!InArchiveRecovery) currentSource = XLOG_FROM_PG_WAL; else if (currentSource == XLOG_FROM_ANY || (!StandbyMode && currentSource == XLOG_FROM_STREAM)) { lastSourceFailed = false; currentSource = XLOG_FROM_ARCHIVE; }). If the archive is
exhausted of WAL or the standby fails to fetch from the archive, then
it switches to XLOG_FROM_STREAM. If the standby fails to receive WAL
from primary, it switches back to XLOG_FROM_ARCHIVE. This continues
unless the standby gets promoted. With the patch, we enable the
standby to try fetching from the primary, instead of waiting for WAL
in the archive to get exhausted or for an error to occur in the
standby while receiving from the archive.

+                                       if (!first_time &&
+                                               TimestampDifferenceExceeds(last_switch_time, curr_time,
+                                                                                                  wal_source_switch_interval))

Shouldn't this also check that wal_source_switch_interval is not set to 0?

Corrected.

+                                               elog(DEBUG2,
+                                                        "trying to switch WAL source to %s after fetching WAL from %s for %d milliseconds",
+                                                        xlogSourceNames[XLOG_FROM_STREAM],
+                                                        xlogSourceNames[currentSource],
+                                                        wal_source_switch_interval);
+
+                                               last_switch_time = curr_time;

Shouldn't the last_switch_time be set when the state machine first enters
XLOG_FROM_ARCHIVE? IIUC this logic is currently counting time spent
elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
source switch. This would mean that a standby that has spent a lot of time
in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
XLOG_FROM_ARCHIVE when it failed again. Given the standby will wait for
wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
seems like we could end up rapidly looping between sources. Perhaps I am
misunderstanding how this is meant to work.

last_switch_time indicates the time when the standby last attempted to
switch to primary. For instance, a standby:
1) for the first time, sets last_switch_time = current_time when in archive mode
2) if current_time < last_switch_time + interval, continues to be in
archive mode
3) if current_time >= last_switch_time + interval, attempts to switch
to primary and sets last_switch_time = current_time
3.1) if successfully switches to primary, continues in there and for
any reason fails to fetch from primary, then enters archive mode and
loops from step (2)
3.2) if fails to switch to primary, then enters archive mode and loops
from step (2)

Hope this clarifies the behaviour.

+       {
+               {"wal_source_switch_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+                       gettext_noop("Sets the time to wait before switching WAL source from archive to primary"),
+                       gettext_noop("0 turns this feature off."),
+                       GUC_UNIT_MS
+               },
+               &wal_source_switch_interval,
+               5000, 0, INT_MAX,
+               NULL, NULL, NULL
+       },

I wonder if the lower bound should be higher to avoid switching
unnecessarily rapidly between WAL sources. I see that
WaitForWALToBecomeAvailable() ensures that standbys do not switch from
XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
wal_retrieve_retry_interval. Perhaps wal_retrieve_retry_interval should be
the lower bound for this GUC, too. Or maybe WaitForWALToBecomeAvailable()
should make sure that the standby makes at least once attempt to restore
the file from archive before switching to streaming replication.

No, we need a way to disable the feature, so I'm not changing the
lower bound. And let's not make this GUC dependent on any other GUC, I
would like to keep it simple for better usability. However, I've
increased the default value to 5min and added a note in the docs about
the lower values.

I'm attaching the v3 patch with the review comments addressed, please
review it further.

[1]: if (!InArchiveRecovery) currentSource = XLOG_FROM_PG_WAL; else if (currentSource == XLOG_FROM_ANY || (!StandbyMode && currentSource == XLOG_FROM_STREAM)) { lastSourceFailed = false; currentSource = XLOG_FROM_ARCHIVE; }
if (!InArchiveRecovery)
currentSource = XLOG_FROM_PG_WAL;
else if (currentSource == XLOG_FROM_ANY ||
(!StandbyMode && currentSource == XLOG_FROM_STREAM))
{
lastSourceFailed = false;
currentSource = XLOG_FROM_ARCHIVE;
}

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/octet-stream; name=v3-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload+270-5
#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#9)
Re: Switching XLog source from archive to streaming when primary available

On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:

On Wed, Sep 7, 2022 at 3:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

It's not clear to me how this is expected to interact with the pg_wal phase
of standby recovery. As the docs note [0], standby servers loop through
archive recovery, recovery from pg_wal, and streaming replication. Does
this cause the pg_wal phase to be skipped (i.e., the standby goes straight
from archive recovery to streaming replication)? I wonder if it'd be
better for this mechanism to simply move the standby to the pg_wal phase so
that the usual ordering isn't changed.

It doesn't change any behaviour as such for XLOG_FROM_PG_WAL. In
standby mode when recovery_command is specified, the initial value of
currentSource would be XLOG_FROM_ARCHIVE (see [1]). If the archive is
exhausted of WAL or the standby fails to fetch from the archive, then
it switches to XLOG_FROM_STREAM. If the standby fails to receive WAL
from primary, it switches back to XLOG_FROM_ARCHIVE. This continues
unless the standby gets promoted. With the patch, we enable the
standby to try fetching from the primary, instead of waiting for WAL
in the archive to get exhausted or for an error to occur in the
standby while receiving from the archive.

Okay. I see that you are checking for XLOG_FROM_ARCHIVE.

Shouldn't the last_switch_time be set when the state machine first enters
XLOG_FROM_ARCHIVE? IIUC this logic is currently counting time spent
elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
source switch. This would mean that a standby that has spent a lot of time
in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
XLOG_FROM_ARCHIVE when it failed again. Given the standby will wait for
wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
seems like we could end up rapidly looping between sources. Perhaps I am
misunderstanding how this is meant to work.

last_switch_time indicates the time when the standby last attempted to
switch to primary. For instance, a standby:
1) for the first time, sets last_switch_time = current_time when in archive mode
2) if current_time < last_switch_time + interval, continues to be in
archive mode
3) if current_time >= last_switch_time + interval, attempts to switch
to primary and sets last_switch_time = current_time
3.1) if successfully switches to primary, continues in there and for
any reason fails to fetch from primary, then enters archive mode and
loops from step (2)
3.2) if fails to switch to primary, then enters archive mode and loops
from step (2)

Let's say I have this new parameter set to 5 minutes, and I have a standby
that's been at step 3.1 for 5 days before failing and going back to step 2.
Won't the standby immediately jump back to step 3.1? I think we should
place the limit on how long the server stays in XLOG_FROM_ARCHIVE, not how
long it's been since we last tried XLOG_FROM_STREAM.

I wonder if the lower bound should be higher to avoid switching
unnecessarily rapidly between WAL sources. I see that
WaitForWALToBecomeAvailable() ensures that standbys do not switch from
XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
wal_retrieve_retry_interval. Perhaps wal_retrieve_retry_interval should be
the lower bound for this GUC, too. Or maybe WaitForWALToBecomeAvailable()
should make sure that the standby makes at least once attempt to restore
the file from archive before switching to streaming replication.

No, we need a way to disable the feature, so I'm not changing the
lower bound. And let's not make this GUC dependent on any other GUC, I
would like to keep it simple for better usability. However, I've
increased the default value to 5min and added a note in the docs about
the lower values.

I'm attaching the v3 patch with the review comments addressed, please
review it further.

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#9)
Re: Switching XLog source from archive to streaming when primary available

Being late for the party.

It seems to me that the function is getting too long. I think we
might want to move the core part of the patch into another function.

I think it might be better if intentionalSourceSwitch doesn't need
lastSourceFailed set. It would look like this:

if (lastSourceFailed || switchSource)
{
if (nonblocking && lastSourceFailed)
return XLREAD_WOULDBLOCK;

+					if (first_time)
+						last_switch_time = curr_time;
..
+					if (!first_time &&
+						TimestampDifferenceExceeds(last_switch_time, curr_time,
..
+					/* We're not here for the first time any more */
+					if (first_time)
+						first_time = false;

I don't think the flag first_time is needed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#10)
Re: Switching XLog source from archive to streaming when primary available

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:

I'm attaching the v3 patch with the review comments addressed, please
review it further.

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#12)
Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:

I'm attaching the v3 patch with the review comments addressed, please
review it further.

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

Hm. In that case, I think we can get rid of timeout based switching
mechanism and have this behaviour - the standby can attempt to switch
to streaming mode from archive, say, after fetching 1, 2 or a
configurable number of WAL files. In fact, this is the original idea
proposed by Satya in this thread.

If okay, I can code on that. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#14Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#13)
Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:

On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

Hm. In that case, I think we can get rid of timeout based switching
mechanism and have this behaviour - the standby can attempt to switch
to streaming mode from archive, say, after fetching 1, 2 or a
configurable number of WAL files. In fact, this is the original idea
proposed by Satya in this thread.

IMO the timeout approach would be more intuitive for users. When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure. WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay. So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#15Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#14)
Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:

On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

Hm. In that case, I think we can get rid of timeout based switching
mechanism and have this behaviour - the standby can attempt to switch
to streaming mode from archive, say, after fetching 1, 2 or a
configurable number of WAL files. In fact, this is the original idea
proposed by Satya in this thread.

IMO the timeout approach would be more intuitive for users. When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure. WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay.

How about the amount of WAL bytes fetched from the archive after which
a standby attempts to connect to primary or enter streaming mode? Of
late, we've changed some GUCs to represent bytes instead of WAL
files/segments, see [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3fe108c025e4a080315562d4c15ecbe3f00405e.

So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

If we have configurable WAL bytes instead of timeout for standby WAL
source switch from archive to primary, we don't have the above problem
right?

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3fe108c025e4a080315562d4c15ecbe3f00405e

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#15)
Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 09, 2022 at 11:07:00PM +0530, Bharath Rupireddy wrote:

On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

IMO the timeout approach would be more intuitive for users. When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure. WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay.

How about the amount of WAL bytes fetched from the archive after which
a standby attempts to connect to primary or enter streaming mode? Of
late, we've changed some GUCs to represent bytes instead of WAL
files/segments, see [1].

Well, for wal_keep_size, using bytes makes sense. Given you know how much
disk space you have, you can set this parameter accordingly to avoid
retaining too much of it for standby servers. For your proposed parameter,
it's not so simple. The same setting could have wildly different timing
behavior depending on the server. I still think that a timeout is the most
intuitive.

So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

If we have configurable WAL bytes instead of timeout for standby WAL
source switch from archive to primary, we don't have the above problem
right?

If you are going to stop replaying in the middle of a WAL archive, then
maybe. But I don't think I'd recommend that.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#17Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#16)
Re: Switching XLog source from archive to streaming when primary available

On Sat, Sep 10, 2022 at 3:35 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Well, for wal_keep_size, using bytes makes sense. Given you know how much
disk space you have, you can set this parameter accordingly to avoid
retaining too much of it for standby servers. For your proposed parameter,
it's not so simple. The same setting could have wildly different timing
behavior depending on the server. I still think that a timeout is the most
intuitive.

Hm. In v3 patch, I've used the timeout approach, but tracking the
duration server spends in XLOG_FROM_ARCHIVE as opposed to tracking
last failed time in streaming from primary.

So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

Added a note in the docs.

On Fri, Sep 9, 2022 at 10:46 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Being late for the party.

Thanks for reviewing this.

It seems to me that the function is getting too long. I think we
might want to move the core part of the patch into another function.

Yeah, the WaitForWALToBecomeAvailable() (without this patch) has
around 460 LOC out of which WAL fetching from the chosen source is of
240 LOC, IMO, this code will be a candidate for a new function. I
think that part can be discussed separately.

Having said that, I moved the new code to a new function.

I think it might be better if intentionalSourceSwitch doesn't need
lastSourceFailed set. It would look like this:

if (lastSourceFailed || switchSource)
{
if (nonblocking && lastSourceFailed)
return XLREAD_WOULDBLOCK;

I think the above looks good, done that way in the latest patch.

I don't think the flag first_time is needed.

Addressed this in the v4 patch.

Please review the attached v4 patch addressing above review comments.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/octet-stream; name=v4-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload+301-12
#18Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#17)
Re: Switching XLog source from archive to streaming when primary available

On Mon, Sep 12, 2022 at 9:03 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please review the attached v4 patch addressing above review comments.

Oops, there's a compiler warning [1]https://cirrus-ci.com/task/5730076611313664?logs=gcc_warning#L450 with the v4 patch, fixed it.
Please review the attached v5 patch.

[1]: https://cirrus-ci.com/task/5730076611313664?logs=gcc_warning#L450

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/x-patch; name=v5-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload+301-12
#19Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#18)
Re: Switching XLog source from archive to streaming when primary available

On Mon, Sep 12, 2022 at 11:56 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please review the attached v5 patch.

I'm attaching the v6 patch that's rebased on to the latest HEAD.
Please consider this for review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/octet-stream; name=v6-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload+301-12
#20Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#19)
Re: Switching XLog source from archive to streaming when primary available

At Thu, 15 Sep 2022 10:28:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

I'm attaching the v6 patch that's rebased on to the latest HEAD.
Please consider this for review.

Thaks for the new version!

+#define StreamingReplRetryEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)

It seems to me a bit too complex..

+			/* Save the timestamp at which we're switching to archive. */
+			if (StreamingReplRetryEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();

Anyway we are going to open a file just after this so
GetCurrentTimestamp() cannot cause a perceptible degradation.
Coulnd't we do that unconditionally, to get rid of the macro?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#21Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#20)
#22Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#21)
#23Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#22)
#24Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#23)
#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#24)
#26Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#25)
#27Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#26)
#28Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#27)
#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#28)
#30Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#29)
#31Ian Lawrence Barwick
barwick@gmail.com
In reply to: Bharath Rupireddy (#30)
#32Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Ian Lawrence Barwick (#31)
#33Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#30)
#34Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#33)
#35Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#34)
#36Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#35)
#37Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#32)
#38Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#37)
#39Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#38)
#40Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#39)
#41Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#40)
#42Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#41)
#43Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#42)
#44Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#43)
#45Japin Li
japinli@hotmail.com
In reply to: Bharath Rupireddy (#44)
#46Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Japin Li (#45)
#47Japin Li
japinli@hotmail.com
In reply to: Bharath Rupireddy (#46)
#48Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Japin Li (#47)
#49Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#48)
#50Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#49)
#51Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#50)
#52Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#51)
#53Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#52)
#54Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#53)
#55Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#54)
#56Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#55)
#57John H
johnhyvr@gmail.com
In reply to: Bharath Rupireddy (#56)
#58Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: John H (#57)
#59John H
johnhyvr@gmail.com
In reply to: Bharath Rupireddy (#58)
#60Andrey Borodin
amborodin@acm.org
In reply to: Bharath Rupireddy (#56)
#61Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#60)