An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

Started by Bharath Rupireddyalmost 4 years ago42 messageshackers
Jump to latest
#1Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com

Hi,

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state (as in the client thinks that the txn would have replicated to
sync standbys) - "The transaction has already committed locally, but
might not have been replicated to the standby.". Upon restart after
the crash or in the next txn after the old locally committed txn was
canceled, the client will be able to see the txns that weren't
actually streamed to sync standbys. Also, if the client fails over to
one of the sync standbys after the crash (either by choice or because
of automatic failover management after crash), the locally committed
txns on the crashed primary would be lost which isn't good in a true
HA solution.

Here's a proposal (mentioned previously by Satya [1]/messages/by-id/CAHg+QDdTdPsqtu0QLG8rMg3Xo=6Xo23TwHPYsUgGNEK13wTY5g@mail.gmail.com) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well. If the ack isn't received even within this time frame, the
backend cancels the wait and goes ahead as it does today. In
production HA environments, the GUC can be set to a reasonable value
to avoid missing transactions during failovers.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

Thoughts?

Here's a WIP patch implementing the (1), I'm yet to code for (2). I
haven't added tests, I'm yet to figure out how to add one as there's
no way we can delay the WAL sender so that we can reliably hit this
code. I will think more about this.

[1]: /messages/by-id/CAHg+QDdTdPsqtu0QLG8rMg3Xo=6Xo23TwHPYsUgGNEK13wTY5g@mail.gmail.com

Regards,
Bharath Rupireddy.

Attachments:

v1-0001-Wait-specified-amount-of-time-before-cancelling-s.patchapplication/x-patch; name=v1-0001-Wait-specified-amount-of-time-before-cancelling-s.patchDownload+97-1
#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#1)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Mon, Apr 25, 2022 at 07:51:03PM +0530, Bharath Rupireddy wrote:

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state (as in the client thinks that the txn would have replicated to
sync standbys) - "The transaction has already committed locally, but
might not have been replicated to the standby.". Upon restart after
the crash or in the next txn after the old locally committed txn was
canceled, the client will be able to see the txns that weren't
actually streamed to sync standbys. Also, if the client fails over to
one of the sync standbys after the crash (either by choice or because
of automatic failover management after crash), the locally committed
txns on the crashed primary would be lost which isn't good in a true
HA solution.

This topic has come up a few times recently [0]/messages/by-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru [1]/messages/by-id/cac4b9df-92c6-77aa-687b-18b86cb13728@stratox.cz [2]/messages/by-id/FDE157D7-3F35-450D-B927-7EC2F82DB1D6@amazon.com.

Thoughts?

І think this will require a fair amount of discussion. I'm personally in
favor of just adding a GUC that can be enabled to block canceling
synchronous replication waits, but I know folks have concerns with that
approach. When I looked at this stuff previously [2]/messages/by-id/FDE157D7-3F35-450D-B927-7EC2F82DB1D6@amazon.com, it seemed possible
to handle the other data loss scenarios (e.g., forcing failover whenever
the primary shut down, turning off restart_after_crash). However, I'm not
wedded to this approach.

[0]: /messages/by-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru
[1]: /messages/by-id/cac4b9df-92c6-77aa-687b-18b86cb13728@stratox.cz
[2]: /messages/by-id/FDE157D7-3F35-450D-B927-7EC2F82DB1D6@amazon.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Bharath Rupireddy (#1)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state [...]

Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled.

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.

Is it worth adding additional complexity that is not a complete solution?

Yours,
Laurenz Albe

#4Andrey Borodin
amborodin@acm.org
In reply to: Nathan Bossart (#2)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

25 апр. 2022 г., в 21:48, Nathan Bossart <nathandbossart@gmail.com> написал(а):

I'm personally in
favor of just adding a GUC that can be enabled to block canceling
synchronous replication waits

+1. I think it's the only option to provide quorum commit guarantees.

Best regards, Andrey Borodin.

#5Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Laurenz Albe (#3)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state [...]

Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled.

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.

Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?

Is it worth adding additional complexity that is not a complete solution?

The proposed approach helps to avoid some common possible problems
that arise with simple scenarios (like cancelling a long running query
while in SyncRepWaitForLSN) within sync replication.

[1]: https://www.postgresql.org/docs/devel/sql-prepare-transaction.html

Regards,
Bharath Rupireddy.

#6Dilip Kumar
dilipbalaut@gmail.com
In reply to: Bharath Rupireddy (#5)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Mon, May 9, 2022 at 2:50 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state [...]

Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled.

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.

Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?

Is it worth adding additional complexity that is not a complete solution?

The proposed approach helps to avoid some common possible problems
that arise with simple scenarios (like cancelling a long running query
while in SyncRepWaitForLSN) within sync replication.

IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#7Andrey Borodin
amborodin@acm.org
In reply to: Bharath Rupireddy (#5)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On 9 May 2022, at 14:20, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.

Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?

COMMIT PREPARED needs to be replicated as well, thus encountering the very same problem as usual COMMIT: if done during failover it can be canceled and committed data can be wrongfully reported durably written. 2PC is not a remedy to the fact that PG silently cancels awaiting of sync replication. The problem arise in presence of any "commit". And "commit" is there if transactions are there.

On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:

IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.

Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously installed to make backend responsible to Ctrl+C (or client side statement timeout).

On 26 Apr 2022, at 11:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Is it worth adding additional complexity that is not a complete solution?

Its not additional complexity. It is removing additional complexity that made sync rep interruptible. (But I'm surely talking not about GUCs like synchronous_replication_naptime_before_cancel - wait of sync rep must be indefinite until synchrous_commit\synchronous_standby_names are satisfied )

And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.

Best regards, Andrey Borodin.

#8Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey Borodin (#7)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:

IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.

Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously installed to make backend responsible to Ctrl+C (or client side statement timeout).

I might be missing something but based on my understanding the
approach is not disallowing the query cancellation but it is just
adding the configuration for how much to delay before canceling the
query. That's the reason I mentioned that this is not a guarenteed
solution. I mean with this configuration value also you can not avoid
problems in all the cases, right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#9Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Dilip Kumar (#8)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Tue, May 10, 2022 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 9 May 2022, at 14:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:

IMHO, making it wait for some amount of time, based on GUC is not a
complete solution. It is just a hack to avoid the problem in some
cases.

Disallowing cancelation of locally committed transactions is not a hack. It's removing of a hack that was erroneously installed to make backend responsible to Ctrl+C (or client side statement timeout).

I might be missing something but based on my understanding the
approach is not disallowing the query cancellation but it is just
adding the configuration for how much to delay before canceling the
query. That's the reason I mentioned that this is not a guarenteed
solution. I mean with this configuration value also you can not avoid
problems in all the cases, right?

Yes Dilip, the proposed GUC in v1 patch doesn't allow waiting forever
for sync repl ack, in other words, doesn't allow blocking the pending
query cancels or proc die interrupts forever. The backends may linger
in case repl ack isn't received or sync replicas aren't reachable?
Users may have to set the GUC to a 'reasonable value'.

If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.

Regards,
Bharath Rupireddy.

#10Andrey Borodin
amborodin@acm.org
In reply to: Bharath Rupireddy (#9)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):

If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.

+1 if we make -1 and 0 only valid values.

query cancels or proc die interrupts

Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.

When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels. Yet once upon a time you want to shutdown postgres without coredump - thus proc die needs to be processed.

Thanks!

Best regards, Andrey Borodin.

#11Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andrey Borodin (#10)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Tue, May 10, 2022 at 5:55 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):

If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.

+1 if we make -1 and 0 only valid values.

query cancels or proc die interrupts

Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.

Agree.

When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels.

When standby is promoted, no point the old primary waiting forever for
ack assuming we are going to discard it.

Yet once upon a time you want to shutdown postgres without coredump - thus proc die needs to be processed.

I think users can still have the flexibility to set the right amounts
of time to process cancel and proc die interrupts.

IMHO, the GUC can still have 0, -1 and value > 0 in milliseconds, let
the users decide on what they want. Do you see any problems with this?

Thoughts?

Regards,
Bharath Rupireddy.

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andrey Borodin (#10)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Tue, May 10, 2022 at 5:55 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

10 мая 2022 г., в 12:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):

If okay, I can make the GUC behave this way - value 0 existing
behaviour i.e. no wait for sync repl ack, just process query cancels
and proc die interrupts immediately; value -1 wait unboundedly for the
ack; value > 0 wait for specified milliseconds for the ack.

+1 if we make -1 and 0 only valid values.

query cancels or proc die interrupts

Please note, that typical HA tool would need to handle query cancels and proc die interrupts differently.

Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.

However, it's good to see what other hackers think about this.

When the network is partitioned and somewhere standby is promoted you definitely want infinite wait for cancels. Yet once upon a time you want to shutdown postgres without coredump - thus proc die needs to be processed.

Agree.

On Mon, May 9, 2022 at 4:39 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 26 Apr 2022, at 11:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Is it worth adding additional complexity that is not a complete solution?

Its not additional complexity. It is removing additional complexity that made sync rep interruptible. (But I'm surely talking not about GUCs like synchronous_replication_naptime_before_cancel - wait of sync rep must be indefinite until synchrous_commit\synchronous_standby_names are satisfied )

And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.

Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]/messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com? Also, quoting the idea here [2]2) Wait for sync standbys to catch up upon restart after the crash or in the next txn after the old locally committed txn was canceled. One way to achieve this is to let the backend, that's making the first connection, wait for sync standbys to catch up in ClientAuthentication right after successful authentication. However, I'm not sure this is the best way to do it at this point..

Thoughts?

[1]: /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2]: 2) Wait for sync standbys to catch up upon restart after the crash or in the next txn after the old locally committed txn was canceled. One way to achieve this is to let the backend, that's making the first connection, wait for sync standbys to catch up in ClientAuthentication right after successful authentication. However, I'm not sure this is the best way to do it at this point.
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

Regards,
Bharath Rupireddy.

#13Andrey Borodin
amborodin@acm.org
In reply to: Bharath Rupireddy (#12)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):

Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.

GUC was proposed here[0]https://commitfest.postgresql.org/34/2402/ to maintain compatibility with previous behaviour. But I think that having no GUC here is fine too. If we do not allow cancelation of unreplicated backends, of course.

And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.

Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]? Also, quoting the idea here [2].

Thoughts?

[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] 2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not replicated to quorum al least up until new timeline LSN.

Thanks!

[0]: https://commitfest.postgresql.org/34/2402/

#14Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andrey Borodin (#13)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):

Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.

GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is fine too. If we do not allow cancelation of unreplicated backends, of course.

And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.

Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]? Also, quoting the idea here [2].

Thoughts?

[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] 2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not replicated to quorum al least up until new timeline LSN.

We can't do it in CheckRecoveryConsistency() unless I'm missing
something. Because, the walsenders (required for sending the remaining
WAL to sync standbys to achieve quorum) can only be started after the
server reaches a consistent state, after all walsenders are
specialized backends.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

#15Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#14)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Thu, Aug 4, 2022 at 1:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

25 июля 2022 г., в 14:29, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> написал(а):

Hm, after thinking for a while, I tend to agree with the above
approach - meaning, query cancel interrupt processing can completely
be disabled in SyncRepWaitForLSN() and process proc die interrupt
immediately, this approach requires no GUC as opposed to the proposed
v1 patch upthread.

GUC was proposed here[0] to maintain compatibility with previous behaviour. But I think that having no GUC here is fine too. If we do not allow cancelation of unreplicated backends, of course.

And yes, we need additional complexity - but in some other place. Transaction can also be locally committed in presence of a server crash. But this another difficult problem. Crashed server must not allow data queries until LSN of timeline end is successfully replicated to synchronous_standby_names.

Hm, that needs to be done anyways. How about doing as proposed
initially upthread [1]? Also, quoting the idea here [2].

Thoughts?

[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] 2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

I think ideally startup process should not allow read only connections in CheckRecoveryConsistency() until WAL is not replicated to quorum al least up until new timeline LSN.

We can't do it in CheckRecoveryConsistency() unless I'm missing
something. Because, the walsenders (required for sending the remaining
WAL to sync standbys to achieve quorum) can only be started after the
server reaches a consistent state, after all walsenders are
specialized backends.

Continuing on the above thought (I inadvertently clicked the send
button previously): A simple approach would be to check for quorum in
PostgresMain() before entering the query loop for (;;) for
non-walsender cases. A disadvantage of this would be that all the
backends will be waiting here in the worst case if it takes time for
achieving the sync quorum after restart - roughly we can do the
following in PostgresMain(), of course we need locking mechanism so
that all the backends whoever reaches here will wait for the same lsn:

if (sync_replicaion_defined == true &&
shmem->wait_for_sync_repl_upon_restart == true)
{
SyncRepWaitForLSN(pg_current_wal_flush_lsn(), false);
shmem->wait_for_sync_repl_upon_restart = false;
}

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Laurenz Albe (#3)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.

Is it worth adding additional complexity that is not a complete solution?

I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.

Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?

We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#16)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.

Is it worth adding additional complexity that is not a complete solution?

I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.

Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?

We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.

There can be more txns that are
locally-committed-but-not-yet-replicated. Even if we have that
information stored somewhere, what do we do with it? Those txns are
committed from the client perspective but not committed from the
server's perspective.

Can you please explain more about your idea, I may be missing something?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#17)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

At Mon, 8 Aug 2022 19:13:25 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe <laurenz.albe@cybertec.at> wrote in

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby. I think that only using the full
power of two-phase commit can make this bulletproof.

Is it worth adding additional complexity that is not a complete solution?

I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.

Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?

We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.

There can be more txns that are
locally-committed-but-not-yet-replicated. Even if we have that
information stored somewhere, what do we do with it? Those txns are
committed from the client perspective but not committed from the
server's perspective.

Can you please explain more about your idea, I may be missing something?

(I'm not sure I understand the requirements here..)

I understand that it is about query cancellation. In the case of
primary crash/termination, client cannot even know whether the commit
of the ongoing transaction, if any, has been recorded. Anyway no way
other than to somehow confirm that the change by the transaction has
been actually made after restart. I believe it is the standard
practice of the applications that work on HA clusters.

The same is true in the case of query cancellation since commit
response doesn't reach the client, too. But even in this case if we
had functions/views that tells us the
last-committed/last-aborted/last-inserted LSN on a session, we can
know whether the last transaction has been committed along with the
commit LSN maybe more easily.

# In fact, I see those functions rather as a means to know whether a
# change by the last transaction on a session is available on some
# replica.

For example, the below heavily simplified pseudo code might display
how the fucntions (if they were functions) work.

try {
s.execute("INSERT ..");
c.commit();
} catch (Exception x) {
c.commit();
if (s.execute("SELECT pg_session_last_committed_lsn() = "
"pg_session_last_inserted_lsn()"))
{
/* the transaction has been locally committed */
if (s.execute("SELECT replay_lsn >= pg_session_last_committed_lsn() "
"FROM pg_stat_replication where xxxx")
/* the commit has been replicated to xxx, LSN is known */
} else {
/* the transaction has not been locally committed */
<retry?>
}
}

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#18)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Tue, Aug 9, 2022 at 12:42 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Can you please explain more about your idea, I may be missing something?

(I'm not sure I understand the requirements here..)

I've explained the problem with the current HA setup with synchronous
replication upthread at [1]/messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com. Let me reiterate it here once again.

When a query is cancelled (a simple stroke of CTRL+C or
pg_cancel_backend() call) while the txn is waiting for ack in
SyncRepWaitForLSN(); for the client, the txn is actually committed
(locally-committed-but-not-yet-replicated to all of sync standbys)
like any other txn, a warning is emitted into server logs but it is of
no use for the client (think of client as applications). There can be
many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
can be issued on all of those sessions. The problem is that the
subsequent reads will then be able to read all of those
locally-committed-but-not-yet-replicated to all of sync standbys txns
data - this is what I call an inconsistency (can we call this a
read-after-write inconsistency?) because of lack of proper query
cancel handling. And if the sync standbys are down or unable to come
up for some reason, until then, the primary will be serving clients
with the inconsistent data. BTW, I found a report of this problem here
[2]: https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication

The solution proposed for the above problem is to just 'not honor
query cancels at all while waiting for ack in SyncRepWaitForLSN()'.

When a proc die is pending, then also, there can be
locally-committed-but-not-yet-replicated to all of sync standbys txns.
Typically, there are two choices for the clients 1) reuse the primary
instance after restart 2) failover to one of sync standbys. For case
(1), there might be read-after-write inconsistency as explained above.
For case (2), those txns might get lost completely if the failover
target sync standby or the new primary didn't receive them and the
other sync standbys that have received them are now ahead and need a
special treatment (run pg_rewind) for them to be able to connect to
new primary.

The solution proposed for case (1) of the above problem is to 'process
the ProcDiePending immediately and upon restart the first backend can
wait until the sync standbys are caught up to ensure no inconsistent
reads'.
The solution proposed for case (2) of the above problem is to 'either
run pg_rewind for the sync standbys that are ahead or use the idea
proposed at [3]/messages/by-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw@mail.gmail.com'.

I hope the above explanation helps.

[1]: /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2]: https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
[3]: /messages/by-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw@mail.gmail.com

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

#20Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#19)
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

On Tue, Aug 9, 2022 at 2:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I've explained the problem with the current HA setup with synchronous
replication upthread at [1]. Let me reiterate it here once again.

When a query is cancelled (a simple stroke of CTRL+C or
pg_cancel_backend() call) while the txn is waiting for ack in
SyncRepWaitForLSN(); for the client, the txn is actually committed
(locally-committed-but-not-yet-replicated to all of sync standbys)
like any other txn, a warning is emitted into server logs but it is of
no use for the client (think of client as applications). There can be
many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
can be issued on all of those sessions. The problem is that the
subsequent reads will then be able to read all of those
locally-committed-but-not-yet-replicated to all of sync standbys txns
data - this is what I call an inconsistency (can we call this a
read-after-write inconsistency?) because of lack of proper query
cancel handling. And if the sync standbys are down or unable to come
up for some reason, until then, the primary will be serving clients
with the inconsistent data. BTW, I found a report of this problem here
[2].

The solution proposed for the above problem is to just 'not honor
query cancels at all while waiting for ack in SyncRepWaitForLSN()'.

When a proc die is pending, then also, there can be
locally-committed-but-not-yet-replicated to all of sync standbys txns.
Typically, there are two choices for the clients 1) reuse the primary
instance after restart 2) failover to one of sync standbys. For case
(1), there might be read-after-write inconsistency as explained above.
For case (2), those txns might get lost completely if the failover
target sync standby or the new primary didn't receive them and the
other sync standbys that have received them are now ahead and need a
special treatment (run pg_rewind) for them to be able to connect to
new primary.

The solution proposed for case (1) of the above problem is to 'process
the ProcDiePending immediately and upon restart the first backend can
wait until the sync standbys are caught up to ensure no inconsistent
reads'.
The solution proposed for case (2) of the above problem is to 'either
run pg_rewind for the sync standbys that are ahead or use the idea
proposed at [3]'.

I hope the above explanation helps.

[1] /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[2] https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
[3] /messages/by-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw@mail.gmail.com

I'm attaching the v2 patch rebased on the latest HEAD. Please note
that there are still some open points, I'm yet to find time to think
more about them. Meanwhile, I'm posting the v2 patch for making cfbot
happy. Any further thoughts on the overall design of the patch are
most welcome. Thanks.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Wait-specified-amount-of-time-before-cancelling-s.patchapplication/x-patch; name=v2-0001-Wait-specified-amount-of-time-before-cancelling-s.patchDownload+97-1
#21Bruce Momjian
bruce@momjian.us
In reply to: Bharath Rupireddy (#20)
#22Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bruce Momjian (#21)
#23Bruce Momjian
bruce@momjian.us
In reply to: Bharath Rupireddy (#22)
#24Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bruce Momjian (#23)
#25Bruce Momjian
bruce@momjian.us
In reply to: Bharath Rupireddy (#24)
#26Andrey Borodin
amborodin@acm.org
In reply to: Bruce Momjian (#21)
#27Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#26)
#28Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andrey Borodin (#27)
#29Bruce Momjian
bruce@momjian.us
In reply to: Andrey Borodin (#27)
#30Bruce Momjian
bruce@momjian.us
In reply to: Bharath Rupireddy (#28)
#31Andrey Borodin
amborodin@acm.org
In reply to: Bruce Momjian (#30)
#32Bruce Momjian
bruce@momjian.us
In reply to: Andrey Borodin (#31)
#33SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Bharath Rupireddy (#28)
#34Bruce Momjian
bruce@momjian.us
In reply to: SATYANARAYANA NARLAPURAM (#33)
#35SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Bruce Momjian (#34)
#36SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: SATYANARAYANA NARLAPURAM (#35)
#37Andrey Borodin
amborodin@acm.org
In reply to: Bruce Momjian (#34)
#38SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Andrey Borodin (#37)
#39SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: SATYANARAYANA NARLAPURAM (#38)
#40Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: SATYANARAYANA NARLAPURAM (#36)
#41Daniel Gustafsson
daniel@yesql.se
In reply to: Bharath Rupireddy (#40)
#42SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Bharath Rupireddy (#40)