An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Hi,
With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state (as in the client thinks that the txn would have replicated to
sync standbys) - "The transaction has already committed locally, but
might not have been replicated to the standby.". Upon restart after
the crash or in the next txn after the old locally committed txn was
canceled, the client will be able to see the txns that weren't
actually streamed to sync standbys. Also, if the client fails over to
one of the sync standbys after the crash (either by choice or because
of automatic failover management after crash), the locally committed
txns on the crashed primary would be lost which isn't good in a true
HA solution.
Here's a proposal (mentioned previously by Satya [1]/messages/by-id/CAHg+QDdTdPsqtu0QLG8rMg3Xo=6Xo23TwHPYsUgGNEK13wTY5g@mail.gmail.com) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well. If the ack isn't received even within this time frame, the
backend cancels the wait and goes ahead as it does today. In
production HA environments, the GUC can be set to a reasonable value
to avoid missing transactions during failovers.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.
Thoughts?
Here's a WIP patch implementing the (1), I'm yet to code for (2). I
haven't added tests, I'm yet to figure out how to add one as there's
no way we can delay the WAL sender so that we can reliably hit this
code. I will think more about this.
[1]: /messages/by-id/CAHg+QDdTdPsqtu0QLG8rMg3Xo=6Xo23TwHPYsUgGNEK13wTY5g@mail.gmail.com
Regards,
Bharath Rupireddy.
Attachments:
v1-0001-Wait-specified-amount-of-time-before-cancelling-s.patchapplication/x-patch; name=v1-0001-Wait-specified-amount-of-time-before-cancelling-s.patchDownload
From d5fe07bbd80b72dfbf06e9b039b9e4a93a7f7a06 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sun, 24 Apr 2022 03:42:59 +0000
Subject: [PATCH v1] Wait specified amount of time before cancelling sync
replication
In PostgreSQL high availability setup with synchronous replication,
typically all the transactions first locally get committed, then
streamed to the synchronous standbys and the backends that generated
the transaction will wait for acknowledgement from synchronous
standbys. While waiting for acknowledgement, it may happen that the
query or the transaction gets canceled or the backend that's waiting
for acknowledgement is asked to exit. In either of these cases, the
wait for acknowledgement gets canceled and leaves transaction in an
inconsistent state as it got committed locally but not on the
standbys. When set the GUC synchronous_replication_naptime_before_cancel
introduced in this patch, it will let the backends wait for the
acknowledgement before canceling the wait for acknowledgement so
that the transaction can be available in synchronous standbys as
well.
---
doc/src/sgml/config.sgml | 30 +++++++++++
src/backend/replication/syncrep.c | 50 +++++++++++++++++++
src/backend/utils/misc/guc.c | 12 +++++
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/include/replication/syncrep.h | 3 ++
5 files changed, 97 insertions(+)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 03986946a8..1681ea173f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4507,6 +4507,36 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-synchronous-replication-naptime-before-cancel" xreflabel="synchronous_replication_naptime_before_cancel">
+ <term><varname>synchronous_replication_naptime_before_cancel</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>synchronous_replication_naptime_before_cancel</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of time in milliseconds to wait for synchronous
+ replication before cancelling. Default value is 0, a value of -1 or 0
+ disables this feature. In <productname>PostgreSQL</productname> high
+ availability setup with synchronous replication, typically all the
+ transactions first locally get committed, then streamed to the
+ synchronous standbys and the backends that generated the transaction
+ will wait for acknowledgement from synchronous standbys. While waiting
+ for acknowledgement, it may happen that the query or the transaction
+ gets canceled or the backend that's waiting for acknowledgement is
+ asked to exit. In either of these cases, the wait for acknowledgement
+ gets canceled and leaves transaction in an inconsistent state as it got
+ committed locally but not on the standbys. When set the
+ <varname>synchronous_replication_naptime_before_cancel</varname>
+ parameter, it will let the backends wait for the acknowledgement
+ before canceling the wait for acknowledgement so that the transaction
+ can be available in synchronous standbys as well. This parameter can
+ only be set in the <filename>postgresql.conf</filename> file or on the
+ server command line.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ce163b99e9..0f54d81f2b 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -88,6 +88,7 @@
/* User-settable parameters for sync rep */
char *SyncRepStandbyNames;
+int SyncRepNapTimeBeforeCancel = 0;
#define SyncStandbysDefined() \
(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
@@ -119,6 +120,7 @@ static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
static int SyncRepGetStandbyPriority(void);
static int standby_priority_comparator(const void *a, const void *b);
static int cmp_lsn(const void *a, const void *b);
+static bool SyncRepNapBeforeCancel(void);
#ifdef USE_ASSERT_CHECKING
static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -130,6 +132,42 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
* ===========================================================
*/
+/*
+ * Wait for synchronous replication before cancelling, if requested by user.
+ */
+static bool
+SyncRepNapBeforeCancel(void)
+{
+ int wait_time;
+
+ if (SyncRepNapTimeBeforeCancel <= 0)
+ return false;
+
+ ereport(WARNING,
+ (errmsg_plural("waiting %d millisecond for synchronous replication before cancelling",
+ "waiting %d milliseconds for synchronous replication before cancelling",
+ SyncRepNapTimeBeforeCancel,
+ SyncRepNapTimeBeforeCancel)));
+
+ wait_time = SyncRepNapTimeBeforeCancel;
+
+ while (wait_time > 0)
+ {
+ /*
+ * Wait in intervals of 1 millisecond so that we can frequently check
+ * for the acknowledgement.
+ */
+ pg_usleep(1 * 1000L);
+
+ wait_time--;
+
+ if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+ return true;
+ }
+
+ return true;
+}
+
/*
* Wait for synchronous replication, if requested by user.
*
@@ -263,6 +301,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
*/
if (ProcDiePending)
{
+ if (SyncRepNapBeforeCancel())
+ {
+ if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+ break;
+ }
+
ereport(WARNING,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
@@ -280,6 +324,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
*/
if (QueryCancelPending)
{
+ if (SyncRepNapBeforeCancel())
+ {
+ if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+ break;
+ }
+
QueryCancelPending = false;
ereport(WARNING,
(errmsg("canceling wait for synchronous replication due to user request"),
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8e9b71375c..547bc2727f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2743,6 +2743,18 @@ static struct config_int ConfigureNamesInt[] =
0, 0, 1000000, /* see ComputeXidHorizons */
NULL, NULL, NULL
},
+
+ {
+ {"synchronous_replication_naptime_before_cancel", PGC_SIGHUP, REPLICATION_PRIMARY,
+ gettext_noop("Sets the amount of time to wait for synchronous replictaion before cancelling."),
+ gettext_noop("A value of -1 or 0 disables this feature."),
+ GUC_UNIT_MS
+ },
+ &SyncRepNapTimeBeforeCancel,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"vacuum_failsafe_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Age at which VACUUM should trigger failsafe to avoid a wraparound outage."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 94270eb0ec..4fd4d04804 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -324,6 +324,8 @@
# and comma-separated list of application_name
# from standby(s); '*' = all
#vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed
+#synchronous_replication_naptime_before_cancel = 0 # amount of time to wait for
+ # synchronous replictaion before cancelling; 0 or -1 disables
# - Standby Servers -
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 4d7c90b9f0..6678f14b93 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -81,6 +81,9 @@ extern PGDLLIMPORT char *syncrep_parse_error_msg;
/* user-settable parameters for synchronous replication */
extern PGDLLIMPORT char *SyncRepStandbyNames;
+/* user-settable nap time for synchronous replictaion before cancelling */
+extern PGDLLIMPORT int SyncRepNapTimeBeforeCancel;
+
/* called by user backend */
extern void SyncRepWaitForLSN(XLogRecPtr lsn, bool commit);
--
2.25.1
On Mon, Apr 25, 2022 at 07:51:03PM +0530, Bharath Rupireddy wrote:
With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state (as in the client thinks that the txn would have replicated to
sync standbys) - "The transaction has already committed locally, but
might not have been replicated to the standby.". Upon restart after
the crash or in the next txn after the old locally committed txn was
canceled, the client will be able to see the txns that weren't
actually streamed to sync standbys. Also, if the client fails over to
one of the sync standbys after the crash (either by choice or because
of automatic failover management after crash), the locally committed
txns on the crashed primary would be lost which isn't good in a true
HA solution.
This topic has come up a few times recently [0]/messages/by-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru [1]/messages/by-id/cac4b9df-92c6-77aa-687b-18b86cb13728@stratox.cz [2]/messages/by-id/FDE157D7-3F35-450D-B927-7EC2F82DB1D6@amazon.com.
Thoughts?
І think this will require a fair amount of discussion. I'm personally in
favor of just adding a GUC that can be enabled to block canceling
synchronous replication waits, but I know folks have concerns with that
approach. When I looked at this stuff previously [2]/messages/by-id/FDE157D7-3F35-450D-B927-7EC2F82DB1D6@amazon.com, it seemed possible
to handle the other data loss scenarios (e.g., forcing failover whenever
the primary shut down, turning off restart_after_crash). However, I'm not
wedded to this approach.
[0]: /messages/by-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru
[1]: /messages/by-id/cac4b9df-92c6-77aa-687b-18b86cb13728@stratox.cz
[2]: /messages/by-id/FDE157D7-3F35-450D-B927-7EC2F82DB1D6@amazon.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state [...]Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled.
While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.
Is it worth adding additional complexity that is not a complete solution?
Yours,
Laurenz Albe
25 апр. 2022 г., в 21:48, Nathan Bossart <nathandbossart@gmail.com> написал(а):
I'm personally in
favor of just adding a GUC that can be enabled to block canceling
synchronous replication waits
+1. I think it's the only option to provide quorum commit guarantees.
Best regards, Andrey Borodin.
On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state [...]Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled.While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.
Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?
Is it worth adding additional complexity that is not a complete solution?
The proposed approach helps to avoid some common possible problems
that arise with simple scenarios (like cancelling a long running query
while in SyncRepWaitForLSN) within sync replication.
[1]: https://www.postgresql.org/docs/devel/sql-prepare-transaction.html
Regards,
Bharath Rupireddy.
On Mon, May 9, 2022 at 2:50 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state [...]Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled.While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?Is it worth adding additional complexity that is not a complete solution?
The proposed approach helps to avoid some common possible problems
that arise with simple scenarios (like cancelling a long running query
while in SyncRepWaitForLSN) within sync replication.
IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 9 May 2022, at 14:20, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?
COMMIT PREPARED needs to be replicated as well, thus encountering the very same problem as usual COMMIT: if done during failover it can be canceled and committed data can be wrongfully reported durably written. 2PC is not a remedy to the fact that PG silently cancels awaiting of sync replication. The problem arise in presence of any "commit". And "commit" is there if transactions are there.
On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:
IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.
Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously installed to make backend responsible to Ctrl+C (or client side statement timeout).
On 26 Apr 2022, at 11:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
Is it worth adding additional complexity that is not a complete solution?
Its not additional complexity. It is removing additional complexity that made sync rep interruptible. (But I'm surely talking not about GUCs like synchronous_replication_naptime_before_cancel - wait of sync rep must be indefinite until synchrous_commit\synchronous_standby_names are satisfied )
And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.
Best regards, Andrey Borodin.
On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:
IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously installed to make backend responsible to Ctrl+C (or client side statement timeout).
I might be missing something but based on my understanding the
approach is not disallowing the query cancellation but it is just
adding the configuration for how much to delay before canceling the
query. That's the reason I mentioned that this is not a guarenteed
solution. I mean with this configuration value also you can not avoid
problems in all the cases, right?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Tue, May 10, 2022 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:
IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously installed to make backend responsible to Ctrl+C (or client side statement timeout).
I might be missing something but based on my understanding the
approach is not disallowing the query cancellation but it is just
adding the configuration for how much to delay before canceling the
query. That's the reason I mentioned that this is not a guarenteed
solution. I mean with this configuration value also you can not avoid
problems in all the cases, right?
Yes Dilip, the proposed GUC in v1 patch doesn't allow waiting forever
for sync repl ack, in other words, doesn't allow blocking the pending
query cancels or proc die interrupts forever. The backends may linger
in case repl ack isn't received or sync replicas aren't reachable?
Users may have to set the GUC to a 'reasonable value'.
If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.
Regards,
Bharath Rupireddy.
10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.
+1 if we make -1 and 0 only valid values.
query cancels or proc die interrupts
Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.
When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels. Yet once upon a time you want to shutdown postgres without coredump - thus proc die needs to be processed.
Thanks!
Best regards, Andrey Borodin.
On Tue, May 10, 2022 at 5:55 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.+1 if we make -1 and 0 only valid values.
query cancels or proc die interrupts
Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.
Agree.
When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels.
When standby is promoted, no point the old primary waiting forever for
ack assuming we are going to discard it.
Yet once upon a time you want to shutdown postgres without coredump - thus proc die needs to be processed.
I think users can still have the flexibility to set the right amounts
of time to process cancel and proc die interrupts.
IMHO, the GUC can still have 0, -1 and value > 0 in milliseconds, let
the users decide on what they want. Do you see any problems with this?
Thoughts?
Regards,
Bharath Rupireddy.
On Tue, May 10, 2022 at 5:55 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.+1 if we make -1 and 0 only valid values.
query cancels or proc die interrupts
Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.
Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.
However, it's good to see what other hackers think about this.
When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels. Yet once upon a time you want to shutdown postgres without coredump - thus proc die needs to be processed.
Agree.
On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
On 26 Apr 2022, at 11:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
Is it worth adding additional complexity that is not a complete solution?
Its not additional complexity. It is removing additional complexity that made sync rep interruptible. (But I'm surely talking not about GUCs like synchronous_replication_naptime_before_cancel - wait of sync rep must be indefinite until synchrous_commit\synchronous_standby_names are satisfied )
And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.
Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]/messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com? Also, quoting the idea here [2]2) Wait for sync standbys to catch up upon restart after the crash or in the next txn after the old locally committed txn was canceled. One way to achieve this is to let the backend, that's making the first connection, wait for sync standbys to catch up in ClientAuthentication right after successful authentication. However, I'm not sure this is the best way to do it at this point..
Thoughts?
[1]: /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2]: 2) Wait for sync standbys to catch up upon restart after the crash or in the next txn after the old locally committed txn was canceled. One way to achieve this is to let the backend, that's making the first connection, wait for sync standbys to catch up in ClientAuthentication right after successful authentication. However, I'm not sure this is the best way to do it at this point.
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.
Regards,
Bharath Rupireddy.
25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.
GUC was proposed here[0]https://commitfest.postgresql.org/34/2402/ to maintain compatibility with previous behaviour. But I think that having no GUC here is fine too. If we do not allow cancelation of unreplicated backends, of course.
And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.
Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]? Also, quoting the idea here [2].Thoughts?
[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] 2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.
I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not replicated to quorum al least up until new timeline LSN.
Thanks!
On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is fine too. If we do not allow cancelation of unreplicated backends, of course.
And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.
Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]? Also, quoting the idea here [2].Thoughts?
[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] 2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not replicated to quorum al least up until new timeline LSN.
We can't do it in CheckRecoveryConsistency() unless I'm missing
something. Because, the walsenders (required for sending the remaining
WAL to sync standbys to achieve quorum) can only be started after the
server reaches a consistent state, after all walsenders are
specialized backends.
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Thu, Aug 4, 2022 at 1:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):
Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is fine too. If we do not allow cancelation of unreplicated backends, of course.
And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.
Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]? Also, quoting the idea here [2].Thoughts?
[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] 2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not replicated to quorum al least up until new timeline LSN.
We can't do it in CheckRecoveryConsistency() unless I'm missing
something. Because, the walsenders (required for sending the remaining
WAL to sync standbys to achieve quorum) can only be started after the
server reaches a consistent state, after all walsenders are
specialized backends.
Continuing on the above thought (I inadvertently clicked the send
button previously): A simple approach would be to check for quorum in
PostgresMain() before entering the query loop for (;;) for
non-walsender cases. A disadvantage of this would be that all the
backends will be waiting here in the worst case if it takes time for
achieving the sync quorum after restart - roughly we can do the
following in PostgresMain(), of course we need locking mechanism so
that all the backends whoever reaches here will wait for the same lsn:
if (sync_replicaion_defined == true &&
shmem->wait_for_sync_repl_upon_restart == true)
{
SyncRepWaitForLSN(pg_current_wal_flush_lsn(), false);
shmem->wait_for_sync_repl_upon_restart = false;
}
Thoughts?
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in
While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.Is it worth adding additional complexity that is not a complete solution?
I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.
Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?
We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in
While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.Is it worth adding additional complexity that is not a complete solution?
I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.
There can be more txns that are
locally-committed-but-not-yet-replicated. Even if we have that
information stored somewhere, what do we do with it? Those txns are
committed from the client perspective but not committed from the
server's perspective.
Can you please explain more about your idea, I may be missing something?
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
At Mon, 8 Aug 2022 19:13:25 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in
While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.Is it worth adding additional complexity that is not a complete solution?
I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.There can be more txns that are
locally-committed-but-not-yet-replicated. Even if we have that
information stored somewhere, what do we do with it? Those txns are
committed from the client perspective but not committed from the
server's perspective.Can you please explain more about your idea, I may be missing something?
(I'm not sure I understand the requirements here..)
I understand that it is about query cancellation. In the case of
primary crash/termination, client cannot even know whether the commit
of the ongoing transaction, if any, has been recorded. Anyway no way
other than to somehow confirm that the change by the transaction has
been actually made after restart. I believe it is the standard
practice of the applications that work on HA clusters.
The same is true in the case of query cancellation since commit
response doesn't reach the client, too. But even in this case if we
had functions/views that tells us the
last-committed/last-aborted/last-inserted LSN on a session, we can
know whether the last transaction has been committed along with the
commit LSN maybe more easily.
# In fact, I see those functions rather as a means to know whether a
# change by the last transaction on a session is available on some
# replica.
For example, the below heavily simplified pseudo code might display
how the fucntions (if they were functions) work.
try {
s.execute("INSERT ..");
c.commit();
} catch (Exception x) {
c.commit();
if (s.execute("SELECT pg_session_last_committed_lsn() = "
"pg_session_last_inserted_lsn()"))
{
/* the transaction has been locally committed */
if (s.execute("SELECT replay_lsn >= pg_session_last_committed_lsn() "
"FROM pg_stat_replication where xxxx")
/* the commit has been replicated to xxx, LSN is known */
} else {
/* the transaction has not been locally committed */
<retry?>
}
}
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Aug 9, 2022 at 12:42 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
Can you please explain more about your idea, I may be missing something?
(I'm not sure I understand the requirements here..)
I've explained the problem with the current HA setup with synchronous
replication upthread at [1]/messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com. Let me reiterate it here once again.
When a query is cancelled (a simple stroke of CTRL+C or
pg_cancel_backend() call) while the txn is waiting for ack in
SyncRepWaitForLSN(); for the client, the txn is actually committed
(locally-committed-but-not-yet-replicated to all of sync standbys)
like any other txn, a warning is emitted into server logs but it is of
no use for the client (think of client as applications). There can be
many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
can be issued on all of those sessions. The problem is that the
subsequent reads will then be able to read all of those
locally-committed-but-not-yet-replicated to all of sync standbys txns
data - this is what I call an inconsistency (can we call this a
read-after-write inconsistency?) because of lack of proper query
cancel handling. And if the sync standbys are down or unable to come
up for some reason, until then, the primary will be serving clients
with the inconsistent data. BTW, I found a report of this problem here
[2]: https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
The solution proposed for the above problem is to just 'not honor
query cancels at all while waiting for ack in SyncRepWaitForLSN()'.
When a proc die is pending, then also, there can be
locally-committed-but-not-yet-replicated to all of sync standbys txns.
Typically, there are two choices for the clients 1) reuse the primary
instance after restart 2) failover to one of sync standbys. For case
(1), there might be read-after-write inconsistency as explained above.
For case (2), those txns might get lost completely if the failover
target sync standby or the new primary didn't receive them and the
other sync standbys that have received them are now ahead and need a
special treatment (run pg_rewind) for them to be able to connect to
new primary.
The solution proposed for case (1) of the above problem is to 'process
the ProcDiePending immediately and upon restart the first backend can
wait until the sync standbys are caught up to ensure no inconsistent
reads'.
The solution proposed for case (2) of the above problem is to 'either
run pg_rewind for the sync standbys that are ahead or use the idea
proposed at [3]/messages/by-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw@mail.gmail.com'.
I hope the above explanation helps.
[1]: /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2]: https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
[3]: /messages/by-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw@mail.gmail.com
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
I've explained the problem with the current HA setup with synchronous
replication upthread at [1]. Let me reiterate it here once again.When a query is cancelled (a simple stroke of CTRL+C or
pg_cancel_backend() call) while the txn is waiting for ack in
SyncRepWaitForLSN(); for the client, the txn is actually committed
(locally-committed-but-not-yet-replicated to all of sync standbys)
like any other txn, a warning is emitted into server logs but it is of
no use for the client (think of client as applications). There can be
many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
can be issued on all of those sessions. The problem is that the
subsequent reads will then be able to read all of those
locally-committed-but-not-yet-replicated to all of sync standbys txns
data - this is what I call an inconsistency (can we call this a
read-after-write inconsistency?) because of lack of proper query
cancel handling. And if the sync standbys are down or unable to come
up for some reason, until then, the primary will be serving clients
with the inconsistent data. BTW, I found a report of this problem here
[2].The solution proposed for the above problem is to just 'not honor
query cancels at all while waiting for ack in SyncRepWaitForLSN()'.When a proc die is pending, then also, there can be
locally-committed-but-not-yet-replicated to all of sync standbys txns.
Typically, there are two choices for the clients 1) reuse the primary
instance after restart 2) failover to one of sync standbys. For case
(1), there might be read-after-write inconsistency as explained above.
For case (2), those txns might get lost completely if the failover
target sync standby or the new primary didn't receive them and the
other sync standbys that have received them are now ahead and need a
special treatment (run pg_rewind) for them to be able to connect to
new primary.The solution proposed for case (1) of the above problem is to 'process
the ProcDiePending immediately and upon restart the first backend can
wait until the sync standbys are caught up to ensure no inconsistent
reads'.
The solution proposed for case (2) of the above problem is to 'either
run pg_rewind for the sync standbys that are ahead or use the idea
proposed at [3]'.I hope the above explanation helps.
[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
[3] /messages/by-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw@mail.gmail.com
I'm attaching the v2 patch rebased on the latest HEAD. Please note
that there are still some open points, I'm yet to find time to think
more about them. Meanwhile, I'm posting the v2 patch for making cfbot
happy. Any further thoughts on the overall design of the patch are
most welcome. Thanks.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Wait-specified-amount-of-time-before-cancelling-s.patchapplication/x-patch; name=v2-0001-Wait-specified-amount-of-time-before-cancelling-s.patchDownload
From be6734c83d72333cc4ec1b2be1f4d54b875b74c8 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 27 Sep 2022 12:55:34 +0000
Subject: [PATCH v2] Wait specified amount of time before cancelling sync
replication
In PostgreSQL high availability setup with synchronous replication,
typically all the transactions first locally get committed, then
streamed to the synchronous standbys and the backends that generated
the transaction will wait for acknowledgement from synchronous
standbys. While waiting for acknowledgement, it may happen that the
query or the transaction gets canceled or the backend that's waiting
for acknowledgement is asked to exit. In either of these cases, the
wait for acknowledgement gets canceled and leaves transaction in an
inconsistent state as it got committed locally but not on the
standbys. When set the GUC synchronous_replication_naptime_before_cancel
introduced in this patch, it will let the backends wait for the
acknowledgement before canceling the wait for acknowledgement so
that the transaction can be available in synchronous standbys as
well.
---
doc/src/sgml/config.sgml | 30 +++++++++++
src/backend/replication/syncrep.c | 50 +++++++++++++++++++
src/backend/utils/misc/guc_tables.c | 12 +++++
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/include/replication/syncrep.h | 3 ++
5 files changed, 97 insertions(+)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d8848bc774..baeef49012 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4524,6 +4524,36 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem>
</varlistentry>
+ <varlistentry id="guc-synchronous-replication-naptime-before-cancel" xreflabel="synchronous_replication_naptime_before_cancel">
+ <term><varname>synchronous_replication_naptime_before_cancel</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>synchronous_replication_naptime_before_cancel</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of time in milliseconds to wait for synchronous
+ replication before cancelling. Default value is 0, a value of -1 or 0
+ disables this feature. In <productname>PostgreSQL</productname> high
+ availability setup with synchronous replication, typically all the
+ transactions first locally get committed, then streamed to the
+ synchronous standbys and the backends that generated the transaction
+ will wait for acknowledgement from synchronous standbys. While waiting
+ for acknowledgement, it may happen that the query or the transaction
+ gets canceled or the backend that's waiting for acknowledgement is
+ asked to exit. In either of these cases, the wait for acknowledgement
+ gets canceled and leaves transaction in an inconsistent state as it got
+ committed locally but not on the standbys. When set the
+ <varname>synchronous_replication_naptime_before_cancel</varname>
+ parameter, it will let the backends wait for the acknowledgement
+ before canceling the wait for acknowledgement so that the transaction
+ can be available in synchronous standbys as well. This parameter can
+ only be set in the <filename>postgresql.conf</filename> file or on the
+ server command line.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index e360d925b0..60b6a5e471 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -89,6 +89,7 @@
/* User-settable parameters for sync rep */
char *SyncRepStandbyNames;
+int SyncRepNapTimeBeforeCancel = 0;
#define SyncStandbysDefined() \
(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
@@ -120,6 +121,7 @@ static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
static int SyncRepGetStandbyPriority(void);
static int standby_priority_comparator(const void *a, const void *b);
static int cmp_lsn(const void *a, const void *b);
+static bool SyncRepNapBeforeCancel(void);
#ifdef USE_ASSERT_CHECKING
static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -131,6 +133,42 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
* ===========================================================
*/
+/*
+ * Wait for synchronous replication before cancelling, if requested by user.
+ */
+static bool
+SyncRepNapBeforeCancel(void)
+{
+ int wait_time;
+
+ if (SyncRepNapTimeBeforeCancel <= 0)
+ return false;
+
+ ereport(WARNING,
+ (errmsg_plural("waiting %d millisecond for synchronous replication before cancelling",
+ "waiting %d milliseconds for synchronous replication before cancelling",
+ SyncRepNapTimeBeforeCancel,
+ SyncRepNapTimeBeforeCancel)));
+
+ wait_time = SyncRepNapTimeBeforeCancel;
+
+ while (wait_time > 0)
+ {
+ /*
+ * Wait in intervals of 1 millisecond so that we can frequently check
+ * for the acknowledgement.
+ */
+ pg_usleep(1 * 1000L);
+
+ wait_time--;
+
+ if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+ return true;
+ }
+
+ return true;
+}
+
/*
* Wait for synchronous replication, if requested by user.
*
@@ -264,6 +302,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
*/
if (ProcDiePending)
{
+ if (SyncRepNapBeforeCancel())
+ {
+ if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+ break;
+ }
+
ereport(WARNING,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
@@ -281,6 +325,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
*/
if (QueryCancelPending)
{
+ if (SyncRepNapBeforeCancel())
+ {
+ if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+ break;
+ }
+
QueryCancelPending = false;
ereport(WARNING,
(errmsg("canceling wait for synchronous replication due to user request"),
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ab3140ff3a..bb14e47c45 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2496,6 +2496,18 @@ struct config_int ConfigureNamesInt[] =
0, 0, 1000000, /* see ComputeXidHorizons */
NULL, NULL, NULL
},
+
+ {
+ {"synchronous_replication_naptime_before_cancel", PGC_SIGHUP, REPLICATION_PRIMARY,
+ gettext_noop("Sets the amount of time to wait for synchronous replictaion before cancelling."),
+ gettext_noop("A value of -1 or 0 disables this feature."),
+ GUC_UNIT_MS
+ },
+ &SyncRepNapTimeBeforeCancel,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"vacuum_failsafe_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Age at which VACUUM should trigger failsafe to avoid a wraparound outage."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2ae76e5cfb..7849e3c5b3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -324,6 +324,8 @@
# and comma-separated list of application_name
# from standby(s); '*' = all
#vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed
+#synchronous_replication_naptime_before_cancel = 0 # amount of time to wait for
+ # synchronous replictaion before cancelling; 0 or -1 disables
# - Standby Servers -
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 0f3b3a7955..9b668e9a61 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -80,6 +80,9 @@ extern PGDLLIMPORT char *syncrep_parse_error_msg;
/* user-settable parameters for synchronous replication */
extern PGDLLIMPORT char *SyncRepStandbyNames;
+/* user-settable nap time for synchronous replictaion before cancelling */
+extern PGDLLIMPORT int SyncRepNapTimeBeforeCancel;
+
/* called by user backend */
extern void SyncRepWaitForLSN(XLogRecPtr lsn, bool commit);
--
2.34.1
On Tue, Sep 27, 2022 at 06:52:21PM +0530, Bharath Rupireddy wrote:
On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:I've explained the problem with the current HA setup with synchronous
replication upthread at [1]. Let me reiterate it here once again.When a query is cancelled (a simple stroke of CTRL+C or
pg_cancel_backend() call) while the txn is waiting for ack in
SyncRepWaitForLSN(); for the client, the txn is actually committed
(locally-committed-but-not-yet-replicated to all of sync standbys)
like any other txn, a warning is emitted into server logs but it is of
no use for the client (think of client as applications). There can be
many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
can be issued on all of those sessions. The problem is that the
subsequent reads will then be able to read all of those
locally-committed-but-not-yet-replicated to all of sync standbys txns
data - this is what I call an inconsistency (can we call this a
read-after-write inconsistency?) because of lack of proper query
cancel handling. And if the sync standbys are down or unable to come
up for some reason, until then, the primary will be serving clients
with the inconsistent data. BTW, I found a report of this problem here
[2].The solution proposed for the above problem is to just 'not honor
query cancels at all while waiting for ack in SyncRepWaitForLSN()'.When a proc die is pending, then also, there can be
locally-committed-but-not-yet-replicated to all of sync standbys txns.
Typically, there are two choices for the clients 1) reuse the primary
instance after restart 2) failover to one of sync standbys. For case
(1), there might be read-after-write inconsistency as explained above.
For case (2), those txns might get lost completely if the failover
target sync standby or the new primary didn't receive them and the
other sync standbys that have received them are now ahead and need a
special treatment (run pg_rewind) for them to be able to connect to
new primary.The solution proposed for case (1) of the above problem is to 'process
the ProcDiePending immediately and upon restart the first backend can
wait until the sync standbys are caught up to ensure no inconsistent
reads'.
The solution proposed for case (2) of the above problem is to 'either
run pg_rewind for the sync standbys that are ahead or use the idea
proposed at [3]'.I hope the above explanation helps.
[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
[3] /messages/by-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw@mail.gmail.comI'm attaching the v2 patch rebased on the latest HEAD. Please note
that there are still some open points, I'm yet to find time to think
more about them. Meanwhile, I'm posting the v2 patch for making cfbot
happy. Any further thoughts on the overall design of the patch are
most welcome. Thanks.
...
In PostgreSQL high availability setup with synchronous replication,
typically all the transactions first locally get committed, then
streamed to the synchronous standbys and the backends that generated
the transaction will wait for acknowledgement from synchronous
standbys. While waiting for acknowledgement, it may happen that the
query or the transaction gets canceled or the backend that's waiting
for acknowledgement is asked to exit. In either of these cases, the
wait for acknowledgement gets canceled and leaves transaction in an
inconsistent state as it got committed locally but not on the
standbys. When set the GUC synchronous_replication_naptime_before_cancel
introduced in this patch, it will let the backends wait for the
acknowledgement before canceling the wait for acknowledgement so
that the transaction can be available in synchronous standbys as
well.
I don't think this patch is going in the right direction, and I think we
need to step back to see why.
First, let's see how synchronous replication works. Each backend waits
for one or more synchronous replicas to acknowledge the WAL related to
its commit and then it marks the commit as done in PGPROC and then to
the client; I wrote a blog about it:
https://momjian.us/main/blogs/pgblog/2020.html#June_3_2020
So, what happens when an insufficient number of synchronous replicas
reply? Sessions hang because the synchronous behavior cannot be
guaranteed. We then _allow_ query cancel so the user or administrator
can get out of the hung sessions and perhaps modify
synchronous_standby_names.
What the proposed patch effectively does is to prevent the ability to
recovery the hung sessions or auto-continue the sessions if an
insufficient number of synchronous replicas respond. So, in a way, it
is allowing for more strict and less strict behavior of
synchronous_standby_names.
However, I think trying to solve this at the session level is the wrong
approach. If you set a timeout to continue stuck sessions, odds are the
timeout will be too long for each session and performance will be
unacceptable anyway, so you haven't gained much. If you prevent
cancels, you effectively lock up the system with fewer options of
recovery.
I have always felt this has to be done at the server level, meaning when
a synchronous_standby_names replica is not responding after a certain
timeout, the administrator must be notified by calling a shell command
defined in a GUC and all sessions will ignore the replica. This gives a
much more predictable and useful behavior than the one in the patch ---
we have discussed this approach many times on the email lists.
Once we have that, we can consider removing the cancel ability while
waiting for synchronous replicas (since we have the timeout) or make it
optional. We can also consider how do notify the administrator during
query cancel (if we allow it), backend abrupt exit/crash, and if we
should allow users to specify a retry interval to resynchronize the
synchronous replicas.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Indecision is a decision. Inaction is an action. Mark Batterson
On Fri, Sep 30, 2022 at 4:23 AM Bruce Momjian <bruce@momjian.us> wrote:
I don't think this patch is going in the right direction, and I think we
need to step back to see why.
Thanks a lot Bruce for providing insights.
First, let's see how synchronous replication works. Each backend waits
for one or more synchronous replicas to acknowledge the WAL related to
its commit and then it marks the commit as done in PGPROC and then to
the client; I wrote a blog about it:
Great!
So, what happens when an insufficient number of synchronous replicas
reply? Sessions hang because the synchronous behavior cannot be
guaranteed. We then _allow_ query cancel so the user or administrator
can get out of the hung sessions and perhaps modify
synchronous_standby_names.
Right.
What the proposed patch effectively does is to prevent the ability to
recovery the hung sessions or auto-continue the sessions if an
insufficient number of synchronous replicas respond. So, in a way, it
is allowing for more strict and less strict behavior of
synchronous_standby_names.
Yes. I do agree that it makes the sessions further wait and closes the
quick exit path that admins/users have when the problem occurs. But it
disallows users cancelling queries or terminating backends only to end
up in locally-committed-but-not-replicated transactions on a server
setup where sync standbys all are working fine. I agree that we don't
want to further wait on cancels or proc dies as it makes the system
more unresponsive [1]/messages/by-id/CA+TgmoaCBwgMDkeBDOgtPgHcbfSYq+zORjL5DoU3pJyjALxtoQ@mail.gmail.com.
However, I think trying to solve this at the session level is the wrong
approach. If you set a timeout to continue stuck sessions, odds are the
timeout will be too long for each session and performance will be
unacceptable anyway, so you haven't gained much. If you prevent
cancels, you effectively lock up the system with fewer options of
recovery.
Yes.
I have always felt this has to be done at the server level, meaning when
a synchronous_standby_names replica is not responding after a certain
timeout, the administrator must be notified by calling a shell command
defined in a GUC and all sessions will ignore the replica. This gives a
much more predictable and useful behavior than the one in the patch ---
we have discussed this approach many times on the email lists.
IIUC, each walsender serving a sync standby will determine that the
sync standby isn't responding for a configurable amount of time (less
than wal_sender_timeout) and calls shell command to notify the admin
if there are any backends waiting for sync replication in
SyncRepWaitForLSN(). The shell command then provides the unresponsive
sync standby name at the bare minimum for the admin to ignore it as
sync standby/remove it from synchronous_standby_names to continue
further. This still requires manual intervention which is a problem if
running postgres server instances at scale. Also, having a new shell
command in any form may pose security risks. I'm not sure at this
point how this new timeout is going to work alongside
wal_sender_timeout.
I'm thinking about the possible options that an admin has to get out
of this situation:
1) Removing the standby from synchronous_standby_names.
2) Fixing the sync standby, by restarting or restoring the lost part
(such as network or some other).
(1) is something that postgres can help admins get out of the problem
easily and automatically without any intervention. (2) is something
postgres can't do much about.
How about we let postgres automatically remove an unresponsive (for a
pre-configured time) sync standby from synchronous_standby_names and
inform the user (via log message and via new walsender property and
pg_stat_replication for monitoring purposes)? The users can then
detect such standbys and later try to bring them back to the sync
standbys group or do other things. I believe that a production level
postgres HA with sync standbys will have monitoring to detect the
replication lag, failover decision etc via monitoring
pg_stat_replication. With this approach, a bit more monitoring is
needed. This solution requires less or no manual intervention and
scales well. Please note that I haven't studied the possibilities of
implementing it yet.
Thoughts?
Once we have that, we can consider removing the cancel ability while
waiting for synchronous replicas (since we have the timeout) or make it
optional. We can also consider how do notify the administrator during
query cancel (if we allow it), backend abrupt exit/crash, and
Yeah. If we have the
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
the users can then choose to disable processing query cancels/proc
dies while waiting for sync replication in SyncRepWaitForLSN().
if we
should allow users to specify a retry interval to resynchronize the
synchronous replicas.
This is another interesting thing to consider if we were to make the
auto-removed (by the above approach) standby a sync standby again
without manual intervention.
Thoughts?
[1]: /messages/by-id/CA+TgmoaCBwgMDkeBDOgtPgHcbfSYq+zORjL5DoU3pJyjALxtoQ@mail.gmail.com
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Sat, Oct 1, 2022 at 06:59:26AM +0530, Bharath Rupireddy wrote:
I have always felt this has to be done at the server level, meaning when
a synchronous_standby_names replica is not responding after a certain
timeout, the administrator must be notified by calling a shell command
defined in a GUC and all sessions will ignore the replica. This gives a
------------------------------------
much more predictable and useful behavior than the one in the patch ---
we have discussed this approach many times on the email lists.IIUC, each walsender serving a sync standby will determine that the
sync standby isn't responding for a configurable amount of time (less
than wal_sender_timeout) and calls shell command to notify the admin
if there are any backends waiting for sync replication in
SyncRepWaitForLSN(). The shell command then provides the unresponsive
sync standby name at the bare minimum for the admin to ignore it as
sync standby/remove it from synchronous_standby_names to continue
further. This still requires manual intervention which is a problem if
running postgres server instances at scale. Also, having a new shell
As I highlighted above, by default you notify the administrator that a
sychronous replica is not responding and then ignore it. If it becomes
responsive again, you notify the administrator again and add it back as
a sychronous replica.
command in any form may pose security risks. I'm not sure at this
point how this new timeout is going to work alongside
wal_sender_timeout.
We have archive_command, so I don't see a problem with another shell
command.
I'm thinking about the possible options that an admin has to get out
of this situation:
1) Removing the standby from synchronous_standby_names.
Yes, see above. We might need a read-only GUC that reports which
sychronous replicas are active. As you can see, there is a lot of API
design required here, but this is the most effective approach.
2) Fixing the sync standby, by restarting or restoring the lost part
(such as network or some other).(1) is something that postgres can help admins get out of the problem
easily and automatically without any intervention. (2) is something
postgres can't do much about.How about we let postgres automatically remove an unresponsive (for a
pre-configured time) sync standby from synchronous_standby_names and
inform the user (via log message and via new walsender property and
pg_stat_replication for monitoring purposes)? The users can then
detect such standbys and later try to bring them back to the sync
standbys group or do other things. I believe that a production level
postgres HA with sync standbys will have monitoring to detect the
replication lag, failover decision etc via monitoring
pg_stat_replication. With this approach, a bit more monitoring is
needed. This solution requires less or no manual intervention and
scales well. Please note that I haven't studied the possibilities of
implementing it yet.Thoughts?
Yes, see above.
Once we have that, we can consider removing the cancel ability while
waiting for synchronous replicas (since we have the timeout) or make it
optional. We can also consider how do notify the administrator during
query cancel (if we allow it), backend abrupt exit/crash, andYeah. If we have the
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
the users can then choose to disable processing query cancels/proc
dies while waiting for sync replication in SyncRepWaitForLSN().
Yes. We might also change things so a query cancel that happens during
sychronous replica waiting can only be done by an administrator, not the
session owner. Again, lots of design needed here.
if we
should allow users to specify a retry interval to resynchronize the
synchronous replicas.This is another interesting thing to consider if we were to make the
auto-removed (by the above approach) standby a sync standby again
without manual intervention.
Yes, see above. You are addressing the right questions here. :-)
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Indecision is a decision. Inaction is an action. Mark Batterson
On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian <bruce@momjian.us> wrote:
As I highlighted above, by default you notify the administrator that a
sychronous replica is not responding and then ignore it. If it becomes
responsive again, you notify the administrator again and add it back as
a sychronous replica.command in any form may pose security risks. I'm not sure at this
point how this new timeout is going to work alongside
wal_sender_timeout.We have archive_command, so I don't see a problem with another shell
command.
Why do we need a new command to inform the admin/user about a sync
replication being ignored (from sync quorum) for not responding or
acknowledging for a certain amount of time in SyncRepWaitForLSN()?
Can't we just add an extra column or use existing sync_state in
pg_stat_replication()? We can either introduce a new state such as
temporary_async or just use the existing state 'potential' [1]https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-VIEW sync_state text Synchronous state of this standby server. Possible values are: async: This standby server is asynchronous. potential: This standby server is now asynchronous, but can potentially become synchronous if one of current synchronous ones fails. sync: This standby server is synchronous. quorum: This standby server is considered as a candidate for quorum standbys.. A
problem is that the server has to be monitored for this extra, new
state. If we do this, we don't need another command to report.
I'm thinking about the possible options that an admin has to get out
of this situation:
1) Removing the standby from synchronous_standby_names.Yes, see above. We might need a read-only GUC that reports which
sychronous replicas are active. As you can see, there is a lot of API
design required here, but this is the most effective approach.
If we use the above approach to report via pg_stat_replication(), we
don't need this.
Once we have that, we can consider removing the cancel ability while
waiting for synchronous replicas (since we have the timeout) or make it
optional. We can also consider how do notify the administrator during
query cancel (if we allow it), backend abrupt exit/crash, andYeah. If we have the
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
the users can then choose to disable processing query cancels/proc
dies while waiting for sync replication in SyncRepWaitForLSN().Yes. We might also change things so a query cancel that happens during
sychronous replica waiting can only be done by an administrator, not the
session owner. Again, lots of design needed here.
Yes, we need infrastructure to track who issued the query cancel or
proc die and so on. IMO, it's not a good way to allow/disallow query
cancels or CTRL+C based on role types - superusers or users with
replication roles or users who are members of any of predefined roles.
In general, it is the walsender serving sync standby that has to mark
itself as async standby by removing itself from
synchronous_standby_names, reloading config variables and waking up
the backends that are waiting in syncrep wait queue for it to update
LSN.
And, the new auto removal timeout should always be set to less than
wal_sender_timeout.
All that said, imagine we have
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution
in one or the other forms with auto removal timeout set to 5 minutes,
any of following can happen:
1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
no query cancel or proc die interrupt is arrived, the sync standby is
made as async standy after the timeout i.e. 5 minutes.
2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
say for about 3 minutes, then query cancel or proc die interrupt is
arrived, should we immediately process it or wait for timeout to
happen (2 more minutes) and then process the interrupt? If we
immediately process the interrupts, then the
locally-committed-but-not-replicated-to-sync-standby problems
described upthread [2]/messages/by-id/CALj2ACXmMWtpmuT-=v8F+Lk4QCbdkeN+yHKXeRGKFfjG96YbKA@mail.gmail.com are left unresolved.
[1]: https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-VIEW sync_state text Synchronous state of this standby server. Possible values are: async: This standby server is asynchronous. potential: This standby server is now asynchronous, but can potentially become synchronous if one of current synchronous ones fails. sync: This standby server is synchronous. quorum: This standby server is considered as a candidate for quorum standbys.
sync_state text
Synchronous state of this standby server. Possible values are:
async: This standby server is asynchronous.
potential: This standby server is now asynchronous, but can
potentially become synchronous if one of current synchronous ones
fails.
sync: This standby server is synchronous.
quorum: This standby server is considered as a candidate for quorum standbys.
[2]: /messages/by-id/CALj2ACXmMWtpmuT-=v8F+Lk4QCbdkeN+yHKXeRGKFfjG96YbKA@mail.gmail.com
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Oct 6, 2022 at 01:33:33PM +0530, Bharath Rupireddy wrote:
On Thu, Oct 6, 2022 at 2:30 AM Bruce Momjian <bruce@momjian.us> wrote:
As I highlighted above, by default you notify the administrator that a
sychronous replica is not responding and then ignore it. If it becomes
responsive again, you notify the administrator again and add it back as
a sychronous replica.command in any form may pose security risks. I'm not sure at this
point how this new timeout is going to work alongside
wal_sender_timeout.We have archive_command, so I don't see a problem with another shell
command.Why do we need a new command to inform the admin/user about a sync
replication being ignored (from sync quorum) for not responding or
acknowledging for a certain amount of time in SyncRepWaitForLSN()?
Can't we just add an extra column or use existing sync_state in
pg_stat_replication()? We can either introduce a new state such as
temporary_async or just use the existing state 'potential' [1]. A
problem is that the server has to be monitored for this extra, new
state. If we do this, we don't need another command to report.
Yes, that is a good point. I assumed people would want notification
immediately rather than waiting for monitoring to notice it. Consider
if you monitor every five seconds but the primary loses sync and goes
down during that five-second interval --- there would be no way to know
if sync stopped and reported committed transactions to the client before
the primary went down. I would love to just rely on monitoring but I am
not sure that is sufficient for this use-case.
Of course, if email is being sent it might be still in the email queue
when the primary goes down, but I guess if I was doing it I would make
sure the email was delivered _before_ returning. The point is that we
would not disable the sync and acknowledge the commit to the client
until the notification command returns success --- that kind of
guarantee is hard to do with monitoring.
These are good discussions to have --- maybe I am wrong.
Once we have that, we can consider removing the cancel ability while
waiting for synchronous replicas (since we have the timeout) or make it
optional. We can also consider how do notify the administrator during
query cancel (if we allow it), backend abrupt exit/crash, andYeah. If we have the
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
the users can then choose to disable processing query cancels/proc
dies while waiting for sync replication in SyncRepWaitForLSN().Yes. We might also change things so a query cancel that happens during
sychronous replica waiting can only be done by an administrator, not the
session owner. Again, lots of design needed here.Yes, we need infrastructure to track who issued the query cancel or
proc die and so on. IMO, it's not a good way to allow/disallow query
cancels or CTRL+C based on role types - superusers or users with
replication roles or users who are members of any of predefined roles.In general, it is the walsender serving sync standby that has to mark
itself as async standby by removing itself from
synchronous_standby_names, reloading config variables and waking up
the backends that are waiting in syncrep wait queue for it to update
LSN.And, the new auto removal timeout should always be set to less than
wal_sender_timeout.All that said, imagine we have
timeout-and-auto-removal-of-standby-from-sync-standbys-list solution
in one or the other forms with auto removal timeout set to 5 minutes,
any of following can happen:1) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
no query cancel or proc die interrupt is arrived, the sync standby is
made as async standy after the timeout i.e. 5 minutes.
2) query is stuck waiting for sync standby ack in SyncRepWaitForLSN(),
say for about 3 minutes, then query cancel or proc die interrupt is
arrived, should we immediately process it or wait for timeout to
happen (2 more minutes) and then process the interrupt? If we
immediately process the interrupts, then the
locally-committed-but-not-replicated-to-sync-standby problems
described upthread [2] are left unresolved.
I have a feeling once we have the timeout, we would disable query cancel
when we are in this stage since it is canceling a committed query. The
timeout would cancel the sync but at least the administrator would know.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Indecision is a decision. Inaction is an action. Mark Batterson
On Thu, Sep 29, 2022 at 3:53 PM Bruce Momjian <bruce@momjian.us> wrote:
So, what happens when an insufficient number of synchronous replicas
reply?
It's a failover.
Sessions hang because the synchronous behavior cannot be
guaranteed. We then _allow_ query cancel so the user or administrator
can get out of the hung sessions and perhaps modify
synchronous_standby_names.
Administrators should not modify synchronous_standby_names.
Administrator must shoot this not in the head.
I have always felt this has to be done at the server level, meaning when
a synchronous_standby_names replica is not responding after a certain
timeout, the administrator must be notified by calling a shell command
defined in a GUC and all sessions will ignore the replica.
Standbys are expelled from the waitlist according to quorum rules. I'd
propose not to invent more quorum rules involving shell scripts.
The Administrator expressed what number of standbys can be offline by
setting synchronous_standby_names. They actively asked for hanging
queries in case of insufficient standbys.
We have reserved administrator connections for the case when all
connection slots are used by hanging queries.
Best regards, Andrey Borodin.
On Tue, Nov 8, 2022 at 9:06 PM Andrey Borodin <amborodin86@gmail.com> wrote:
On Thu, Sep 29, 2022 at 3:53 PM Bruce Momjian <bruce@momjian.us> wrote:
So, what happens when an insufficient number of synchronous replicas
reply?It's a failover.
Sessions hang because the synchronous behavior cannot be
guaranteed. We then _allow_ query cancel so the user or administrator
can get out of the hung sessions and perhaps modify
synchronous_standby_names.Administrators should not modify synchronous_standby_names.
Administrator must shoot this not in the head.
Some funny stuff. If a user tries to cancel a non-replicated transaction
Azure Postgres will answer: "user requested cancel while waiting for
synchronous replication ack. The COMMIT record has already flushed to
WAL locally and might not have been replicatead to the standby. We
must wait here."
AWS RDS will answer: "ignoring request to cancel wait for synchronous
replication"
Yandex Managed Postgres will answer: "canceling wait for synchronous
replication due requested, but cancelation is not allowed. The
transaction has already committed locally and might not have been
replicated to the standby. We must wait here."
So, for many services providing Postgres as a service it's only a
matter of wording.
Best regards, Andrey Borodin.
On Mon, Nov 28, 2022 at 12:57 AM Andrey Borodin <amborodin86@gmail.com> wrote:
Some funny stuff. If a user tries to cancel a non-replicated transaction
Azure Postgres will answer: "user requested cancel while waiting for
synchronous replication ack. The COMMIT record has already flushed to
WAL locally and might not have been replicatead to the standby. We
must wait here."
AWS RDS will answer: "ignoring request to cancel wait for synchronous
replication"
Yandex Managed Postgres will answer: "canceling wait for synchronous
replication due requested, but cancelation is not allowed. The
transaction has already committed locally and might not have been
replicated to the standby. We must wait here."So, for many services providing Postgres as a service it's only a
matter of wording.
Thanks for verifying the behaviour. And many thanks for an off-list chat.
FWIW, I'm planning to prepare a patch as per the below idea which is
something similar to the initial proposal in this thread. Meanwhile,
thoughts are welcome.
1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for
sync replication acknowledgement.
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Sun, Nov 27, 2022 at 11:26:50AM -0800, Andrey Borodin wrote:
Some funny stuff. If a user tries to cancel a non-replicated transaction
Azure Postgres will answer: "user requested cancel while waiting for
synchronous replication ack. The COMMIT record has already flushed to
WAL locally and might not have been replicatead to the standby. We
must wait here."
AWS RDS will answer: "ignoring request to cancel wait for synchronous
replication"
Yandex Managed Postgres will answer: "canceling wait for synchronous
replication due requested, but cancelation is not allowed. The
transaction has already committed locally and might not have been
replicated to the standby. We must wait here."So, for many services providing Postgres as a service it's only a
matter of wording.
Wow, you are telling me all three cloud vendors changed how query cancel
behaves on an unresponsive synchronous replica? That is certainly a
testament that the community needs to change or at least review our
behavior.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Mon, Nov 28, 2022 at 12:03:06PM +0530, Bharath Rupireddy wrote:
Thanks for verifying the behaviour. And many thanks for an off-list chat.
FWIW, I'm planning to prepare a patch as per the below idea which is
something similar to the initial proposal in this thread. Meanwhile,
thoughts are welcome.1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for
sync replication acknowledgement.
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.
You can prepare a patch, but it unlikely to get much interest until you
get agreement on what the behavior should be. The optimal order of
developer actions is:
Desirability -> Design -> Implement -> Test -> Review -> Commit
https://wiki.postgresql.org/wiki/Todo#Development_Process
Telling us what other cloud vendors do is not sufficient.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Mon, Nov 28, 2022 at 12:59 PM Bruce Momjian <bruce@momjian.us> wrote:
You can prepare a patch, but it unlikely to get much interest until you
get agreement on what the behavior should be.
We discussed the approach on 2020's Unconference [0]https://wiki.postgresql.org/wiki/PgCon_2020_Developer_Unconference/Edge_cases_of_synchronous_replication_in_HA_solutions. And there kind
of was an agreement.
Then I made a presentation on FOSDEM with all the details [1]https://archive.fosdem.org/2021/schedule/event/postgresql_caveats_of_replication/attachments/slides/4365/export/events/attachments/postgresql_caveats_of_replication/slides/4365/sides.pdf.
The patch had been on commitfest since 2019 [2]https://commitfest.postgresql.org/26/2402/. There were reviewers
in the CF entry, and we kind of had an agreement.
Jeff Davis proposed a similar patch [3]/messages/by-id/6a052e81060824a8286148b1165bafedbd7c86cd.camel@j-davis.com. And we certainly agreed about cancels.
And now Bharath is proposing the same.
We have the interest and agreement.
Best regards, Andrey Borodin.
[0]: https://wiki.postgresql.org/wiki/PgCon_2020_Developer_Unconference/Edge_cases_of_synchronous_replication_in_HA_solutions
[1]: https://archive.fosdem.org/2021/schedule/event/postgresql_caveats_of_replication/attachments/slides/4365/export/events/attachments/postgresql_caveats_of_replication/slides/4365/sides.pdf
[2]: https://commitfest.postgresql.org/26/2402/
[3]: /messages/by-id/6a052e81060824a8286148b1165bafedbd7c86cd.camel@j-davis.com
On Mon, Nov 28, 2022 at 01:31:39PM -0800, Andrey Borodin wrote:
On Mon, Nov 28, 2022 at 12:59 PM Bruce Momjian <bruce@momjian.us> wrote:
You can prepare a patch, but it unlikely to get much interest until you
get agreement on what the behavior should be.We discussed the approach on 2020's Unconference [0]. And there kind
of was an agreement.
Then I made a presentation on FOSDEM with all the details [1].
The patch had been on commitfest since 2019 [2]. There were reviewers
in the CF entry, and we kind of had an agreement.
Jeff Davis proposed a similar patch [3]. And we certainly agreed about cancels.
And now Bharath is proposing the same.We have the interest and agreement.
Okay, I was not aware we had such broad agreement.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Sun, Nov 27, 2022 at 10:33 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Nov 28, 2022 at 12:57 AM Andrey Borodin <amborodin86@gmail.com>
wrote:Some funny stuff. If a user tries to cancel a non-replicated transaction
Azure Postgres will answer: "user requested cancel while waiting for
synchronous replication ack. The COMMIT record has already flushed to
WAL locally and might not have been replicatead to the standby. We
must wait here."
AWS RDS will answer: "ignoring request to cancel wait for synchronous
replication"
Yandex Managed Postgres will answer: "canceling wait for synchronous
replication due requested, but cancelation is not allowed. The
transaction has already committed locally and might not have been
replicated to the standby. We must wait here."So, for many services providing Postgres as a service it's only a
matter of wording.Thanks for verifying the behaviour. And many thanks for an off-list chat.
FWIW, I'm planning to prepare a patch as per the below idea which is
something similar to the initial proposal in this thread. Meanwhile,
thoughts are welcome.1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for
sync replication acknowledgement.
+1
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.
Are you planning to block connections or queries to the database? It would
be good to allow connections and let them query the monitoring views but
block the queries until sync standby have caught up. Otherwise, this leaves
a monitoring hole. In cloud, I presume superusers are allowed to connect
and monitor (end customers are not the role members and can't query the
data). The same can't be true for all the installations. Could you please
add more details on your approach?
Show quoted text
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.Are you planning to block connections or queries to the database? It would be
good to allow connections and let them query the monitoring views but block the
queries until sync standby have caught up. Otherwise, this leaves a monitoring
hole. In cloud, I presume superusers are allowed to connect and monitor (end
customers are not the role members and can't query the data). The same can't be
true for all the installations. Could you please add more details on your
approach?
I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.Are you planning to block connections or queries to the database? It
would be
good to allow connections and let them query the monitoring views but
block the
queries until sync standby have caught up. Otherwise, this leaves a
monitoring
hole. In cloud, I presume superusers are allowed to connect and monitor
(end
customers are not the role members and can't query the data). The same
can't be
true for all the installations. Could you please add more details on your
approach?I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?
Yes, Change in synchronous_standby_names is expected in this situation.
IMHO, blocking all the connections is not a recommended approach.
Show quoted text
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.comEmbrace your flaws. They make you human, rather than perfect,
which you will never be.
On Tue, Nov 29, 2022 at 8:42 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, uponrestart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.Are you planning to block connections or queries to the database? It
would be
good to allow connections and let them query the monitoring views but
block the
queries until sync standby have caught up. Otherwise, this leaves a
monitoring
hole. In cloud, I presume superusers are allowed to connect and monitor
(end
customers are not the role members and can't query the data). The same
can't be
true for all the installations. Could you please add more details on
your
approach?
I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?Yes, Change in synchronous_standby_names is expected in this situation.
IMHO, blocking all the connections is not a recommended approach.
How about allowing superusers (they can still read locally committed data)
and users part of pg_monitor role?
Show quoted text
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.comEmbrace your flaws. They make you human, rather than perfect,
which you will never be.
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.Are you planning to block connections or queries to the database? It would be
good to allow connections and let them query the monitoring views but block the
queries until sync standby have caught up. Otherwise, this leaves a monitoring
hole. In cloud, I presume superusers are allowed to connect and monitor (end
customers are not the role members and can't query the data). The same can't be
true for all the installations. Could you please add more details on your
approach?I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?
We don't allow SQL access during crash recovery until it's caught up
to consistency point. And that's for a reason - the cluster may have
invalid system catalog.
So no, after crash without a quorum of standbys you can only change
auto.conf and send SIGHUP. Accessing the system catalog during crash
recovery is another unrelated problem.
But I'd propose to treat these two points differently, they possess
drastically different scales of danger. Query Cancels are issued here
and there during failovers\switchovers. Crash amidst network
partitioning is not that common.
Best regards, Andrey Borodin.
On Tue, Nov 29, 2022 at 10:52 AM Andrey Borodin <amborodin86@gmail.com>
wrote:
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, uponrestart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.Are you planning to block connections or queries to the database? It
would be
good to allow connections and let them query the monitoring views but
block the
queries until sync standby have caught up. Otherwise, this leaves a
monitoring
hole. In cloud, I presume superusers are allowed to connect and
monitor (end
customers are not the role members and can't query the data). The same
can't be
true for all the installations. Could you please add more details on
your
approach?
I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?We don't allow SQL access during crash recovery until it's caught up
to consistency point. And that's for a reason - the cluster may have
invalid system catalog.
So no, after crash without a quorum of standbys you can only change
auto.conf and send SIGHUP. Accessing the system catalog during crash
recovery is another unrelated problem.
In the crash recovery case, catalog is inconsistent but in this case, the
cluster has remote uncommitted changes (consistent). Accepting a superuser
connection is no harm. The auth checks performed are still valid after
standbys fully caught up. I don't see a reason why superuser / pg_monitor
connections are required to be blocked.
But I'd propose to treat these two points differently, they possess
drastically different scales of danger. Query Cancels are issued here
and there during failovers\switchovers. Crash amidst network
partitioning is not that common.
Supportability and operability are more important in corner cases to
quickly troubleshoot an issue,
Show quoted text
Best regards, Andrey Borodin.
On Tue, Nov 29, 2022 at 11:20 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:
On Tue, Nov 29, 2022 at 10:52 AM Andrey Borodin <amborodin86@gmail.com>
wrote:On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM
wrote:
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, uponrestart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.Are you planning to block connections or queries to the database? It
would be
good to allow connections and let them query the monitoring views but
block the
queries until sync standby have caught up. Otherwise, this leaves a
monitoring
hole. In cloud, I presume superusers are allowed to connect and
monitor (end
customers are not the role members and can't query the data). The
same can't be
true for all the installations. Could you please add more details on
your
approach?
I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?We don't allow SQL access during crash recovery until it's caught up
to consistency point. And that's for a reason - the cluster may have
invalid system catalog.
So no, after crash without a quorum of standbys you can only change
auto.conf and send SIGHUP. Accessing the system catalog during crash
recovery is another unrelated problem.In the crash recovery case, catalog is inconsistent but in this case, the
cluster has remote uncommitted changes (consistent). Accepting a superuser
connection is no harm. The auth checks performed are still valid after
standbys fully caught up. I don't see a reason why superuser / pg_monitor
connections are required to be blocked.
If blocking queries is harder, and superuser is not allowed to connect as
it can read remote uncommitted data, how about adding a new role that can
update and reload the server configuration?
Show quoted text
But I'd propose to treat these two points differently, they possess
drastically different scales of danger. Query Cancels are issued here
and there during failovers\switchovers. Crash amidst network
partitioning is not that common.Supportability and operability are more important in corner cases to
quickly troubleshoot an issue,Best regards, Andrey Borodin.
On Tue, Nov 29, 2022 at 10:45 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
On Tue, Nov 29, 2022 at 8:42 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
2. Process proc die immediately when a backend is waiting for sync
replication acknowledgement, as it does today, however, upon restart,
don't open up for business (don't accept ready-only connections)
unless the sync standbys have caught up.Are you planning to block connections or queries to the database? It would be
good to allow connections and let them query the monitoring views but block the
queries until sync standby have caught up. Otherwise, this leaves a monitoring
hole. In cloud, I presume superusers are allowed to connect and monitor (end
customers are not the role members and can't query the data). The same can't be
true for all the installations. Could you please add more details on your
approach?I think ALTER SYSTEM should be allowed, particularly so you can modify
synchronous_standby_names, no?Yes, Change in synchronous_standby_names is expected in this situation. IMHO, blocking all the connections is not a recommended approach.
How about allowing superusers (they can still read locally committed data) and users part of pg_monitor role?
I started to spend time on this feature again. Thanks all for your
comments so far.
Per latest comments, it looks like we're mostly okay to emit a warning
and ignore query cancel interrupts while waiting for sync replication
ACK.
For proc die, it looks like the suggestion was to process it
immediately and upon next restart, don't allow user connections unless
all sync standbys were caught up. However, we need to be able to allow
replication connections from standbys so that they'll be able to
stream the needed WAL and catch up with primary, allow superuser or
users with pg_monitor role to connect to perform ALTER SYSTEM to
remove the unresponsive sync standbys if any from the list or disable
sync replication altogether or monitor for flush lsn/catch up status.
And block all other connections. Note that replication, superuser and
users with pg_monitor role connections are allowed only after the
server reaches a consistent state not before that to not read any
inconsistent data.
The trickiest part of doing the above is how we detect upon restart
that the server received proc die while waiting for sync replication
ACK. One idea might be to set a flag in the control file before the
crash. Second idea might be to write a marker file (although I don't
favor this idea); presence indicates that the server was waiting for
sync replication ACK before the crash. However, we may not detect all
sorts of crashes in a backend when it is waiting for sync replication
ACK to do any of these two ideas. Therefore, this may not be a
complete solution.
Third idea might be to just let the primary wait for sync standbys to
catch up upon restart irrespective of whether it was crashed or not
while waiting for sync replication ACK. While this idea works well
without having to detect all sorts of crashes, the primary may not
come up if any unresponsive standbys are present (currently, the
primary continues to be operational for read-only queries at least
irrespective of whether sync standbys have caught up or not).
Thoughts?
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 30 Jan 2023, at 06:55, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
I started to spend time on this feature again. Thanks all for your
comments so far.
Since there hasn't been any updates for the past six months, and the patch
hasn't applied for a few months, I am marking this returned with feedback for
now. Please feel free to open a new entry in a future CF for this patch when
there is a new version.
--
Daniel Gustafsson