Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers
Hi,
It looks like the logical replication subscribers are receiving the
quorum uncommitted transactions even before the synchronous (sync)
standbys. Most of the times it is okay, but it can be a problem if the
primary goes down/crashes (while the primary is in SyncRepWaitForLSN)
before the quorum commit is achieved (i.e. before the sync standbys
receive the committed txns from the primary) and the failover is to
happen on to the sync standby. The subscriber would have received the
quorum uncommitted txns whereas the sync standbys didn't. After the
failover, the new primary (the old sync standby) would be behind the
subscriber i.e. the subscriber will be seeing the data that the new
primary can't. Is there a way the subscriber can get back to be in
sync with the new primary? In other words, can we reverse the effects
of the quorum uncommitted txns on the subscriber? Naive way is to do
it manually, but it doesn't seem to be elegant.
We have performed a small experiment to observe the above behaviour
with 1 primary, 1 sync standby and 1 subscriber:
1) Have a wait loop in SyncRepWaitForLSN (a temporary hack to
illustrate the standby receiving the txn a bit late or fail to
receive)
2) Insert data into a table on the primary
3) The primary waits i.e. the insert query hangs (because of the wait
loop hack ()) before the local txn is sent to the sync standby,
whereas the subscriber receives the inserted data.
4) If the primary crashes/goes down and unable to come up, if the
failover happens to sync standby (which didn't receive the data that
got inserted on tbe primary), the subscriber would see the data that
the sync standby can't.
This looks to be a problem. A possible solution is to let the
subscribers receive the txns only after the primary achieves quorum
commit (gets out of the SyncRepWaitForLSN or after all sync standbys
received the txns). The logical replication walsenders can wait until
the quorum commit is obtained and then can send the WAL. A new GUC can
be introduced to control this, default being the current behaviour.
Thoughts?
Thanks Satya (cc-ed) for the use-case and off-list discussion.
Regards,
Bharath Rupireddy.
Consider a cluster formation where we have a Primary(P), Sync Replica(S1),
and multiple async replicas for disaster recovery and read scaling (within
the region and outside the region). In this setup, S1 is the preferred
failover target in an event of the primary failure. When a transaction is
committed on the primary, it is not acknowledged to the client until the
primary gets an acknowledgment from the sync standby that the WAL is
flushed to the disk (assume synchrnous_commit configuration is
remote_flush). However, walsenders corresponds to the async replica on the
primary don't wait for the flush acknowledgment from the primary and send
the WAL to the async standbys (also any logical replication/decoding
clients). So it is possible for the async replicas and logical client ahead
of the sync replica. If a failover is initiated in such a scenario, to
bring the formation into a healthy state we have to either
1. run the pg_rewind on the async replicas for them to reconnect with
the new primary or
2. collect the latest WAL across the replicas and feed the standby.
Both these operations are involved, error prone, and can cause multiple
minutes of downtime if done manually. In addition, there is a window where
the async replicas can show the data that was neither acknowledged to the
client nor committed on standby. Logical clients if they are ahead may need
to reseed the data as no easy rewind option for them.
I would like to propose a GUC send_Wal_after_quorum_committed which when
set to ON, walsenders corresponds to async standbys and logical replication
workers wait until the LSN is quorum committed on the primary before
sending it to the standby. This not only simplifies the post failover steps
but avoids unnecessary downtime for the async replicas. Thoughts?
Thanks,
Satya
On Sun, Dec 5, 2021 at 8:35 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Show quoted text
Hi,
It looks like the logical replication subscribers are receiving the
quorum uncommitted transactions even before the synchronous (sync)
standbys. Most of the times it is okay, but it can be a problem if the
primary goes down/crashes (while the primary is in SyncRepWaitForLSN)
before the quorum commit is achieved (i.e. before the sync standbys
receive the committed txns from the primary) and the failover is to
happen on to the sync standby. The subscriber would have received the
quorum uncommitted txns whereas the sync standbys didn't. After the
failover, the new primary (the old sync standby) would be behind the
subscriber i.e. the subscriber will be seeing the data that the new
primary can't. Is there a way the subscriber can get back to be in
sync with the new primary? In other words, can we reverse the effects
of the quorum uncommitted txns on the subscriber? Naive way is to do
it manually, but it doesn't seem to be elegant.We have performed a small experiment to observe the above behaviour
with 1 primary, 1 sync standby and 1 subscriber:
1) Have a wait loop in SyncRepWaitForLSN (a temporary hack to
illustrate the standby receiving the txn a bit late or fail to
receive)
2) Insert data into a table on the primary
3) The primary waits i.e. the insert query hangs (because of the wait
loop hack ()) before the local txn is sent to the sync standby,
whereas the subscriber receives the inserted data.
4) If the primary crashes/goes down and unable to come up, if the
failover happens to sync standby (which didn't receive the data that
got inserted on tbe primary), the subscriber would see the data that
the sync standby can't.This looks to be a problem. A possible solution is to let the
subscribers receive the txns only after the primary achieves quorum
commit (gets out of the SyncRepWaitForLSN or after all sync standbys
received the txns). The logical replication walsenders can wait until
the quorum commit is obtained and then can send the WAL. A new GUC can
be introduced to control this, default being the current behaviour.Thoughts?
Thanks Satya (cc-ed) for the use-case and off-list discussion.
Regards,
Bharath Rupireddy.
On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
I would like to propose a GUC send_Wal_after_quorum_committed which
when set to ON, walsenders corresponds to async standbys and logical
replication workers wait until the LSN is quorum committed on the
primary before sending it to the standby. This not only simplifies
the post failover steps but avoids unnecessary downtime for the async
replicas. Thoughts?
Do we need a GUC? Or should we just always require that sync rep is
satisfied before sending to async replicas?
It feels like the sync quorum should always be ahead of the async
replicas. Unless I'm missing a use case, or there is some kind of
performance gotcha.
Regards,
Jeff Davis
On Thu, Jan 6, 2022 at 11:24 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
I would like to propose a GUC send_Wal_after_quorum_committed which
when set to ON, walsenders corresponds to async standbys and logical
replication workers wait until the LSN is quorum committed on the
primary before sending it to the standby. This not only simplifies
the post failover steps but avoids unnecessary downtime for the async
replicas. Thoughts?Do we need a GUC? Or should we just always require that sync rep is
satisfied before sending to async replicas?
I proposed a GUC to not introduce a behavior change by default. I have no
strong opinion on having a GUC or making the proposed behavior default,
would love to get others' perspectives as well.
It feels like the sync quorum should always be ahead of the async
replicas. Unless I'm missing a use case, or there is some kind of
performance gotcha.
I couldn't think of a case that can cause serious performance issues but
will run some experiments on this and post the numbers.
Show quoted text
Regards,
Jeff Davis
At Thu, 6 Jan 2022 23:55:01 -0800, SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote in
On Thu, Jan 6, 2022 at 11:24 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
I would like to propose a GUC send_Wal_after_quorum_committed which
when set to ON, walsenders corresponds to async standbys and logical
replication workers wait until the LSN is quorum committed on the
primary before sending it to the standby. This not only simplifies
the post failover steps but avoids unnecessary downtime for the async
replicas. Thoughts?Do we need a GUC? Or should we just always require that sync rep is
satisfied before sending to async replicas?I proposed a GUC to not introduce a behavior change by default. I have no
strong opinion on having a GUC or making the proposed behavior default,
would love to get others' perspectives as well.It feels like the sync quorum should always be ahead of the async
replicas. Unless I'm missing a use case, or there is some kind of
performance gotcha.I couldn't think of a case that can cause serious performance issues but
will run some experiments on this and post the numbers.
I think Jeff is saying that "quorum commit" already by definition
means that all out-of-quorum standbys are behind of the
quorum-standbys. I agree to that in a dictionary sense. But I can
think of the case where the response from the top-runner standby
vanishes or gets caught somewhere on network for some reason. In that
case the primary happily checks quorum ignoring the top-runner.
To avoid that misdecision, I can guess two possible "solutions".
One is to serialize WAL sending (of course it is unacceptable at all)
or aotehr is to send WAL to all standbys at once then make the
decision after making sure receiving replies from all standbys (this
is no longer quorum commit in another sense..)
So I'm afraid that there's no sensible solution to avoid the
hiding-forerunner problem on quorum commit.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Jan 7, 2022 at 12:27 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
At Thu, 6 Jan 2022 23:55:01 -0800, SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote inOn Thu, Jan 6, 2022 at 11:24 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
I would like to propose a GUC send_Wal_after_quorum_committed which
when set to ON, walsenders corresponds to async standbys and logical
replication workers wait until the LSN is quorum committed on the
primary before sending it to the standby. This not only simplifies
the post failover steps but avoids unnecessary downtime for the async
replicas. Thoughts?Do we need a GUC? Or should we just always require that sync rep is
satisfied before sending to async replicas?I proposed a GUC to not introduce a behavior change by default. I have no
strong opinion on having a GUC or making the proposed behavior default,
would love to get others' perspectives as well.It feels like the sync quorum should always be ahead of the async
replicas. Unless I'm missing a use case, or there is some kind of
performance gotcha.I couldn't think of a case that can cause serious performance issues but
will run some experiments on this and post the numbers.I think Jeff is saying that "quorum commit" already by definition
means that all out-of-quorum standbys are behind of the
quorum-standbys. I agree to that in a dictionary sense. But I can
think of the case where the response from the top-runner standby
vanishes or gets caught somewhere on network for some reason. In that
case the primary happily checks quorum ignoring the top-runner.To avoid that misdecision, I can guess two possible "solutions".
One is to serialize WAL sending (of course it is unacceptable at all)
or aotehr is to send WAL to all standbys at once then make the
decision after making sure receiving replies from all standbys (this
is no longer quorum commit in another sense..)
There is no need to serialize sending the WAL among sync standbys. The only
serialization required is first to all the sync replicas and then to sync
replicas if any. Once an LSN is quorum committed, no failover subsystem
initiates an automatic failover such that the LSN is lost (data loss)
So I'm afraid that there's no sensible solution to avoid the
hiding-forerunner problem on quorum commit.
Could you elaborate on the problem here?
Show quoted text
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 1/6/22, 11:25 PM, "Jeff Davis" <pgsql@j-davis.com> wrote:
On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
I would like to propose a GUC send_Wal_after_quorum_committed which
when set to ON, walsenders corresponds to async standbys and logical
replication workers wait until the LSN is quorum committed on the
primary before sending it to the standby. This not only simplifies
the post failover steps but avoids unnecessary downtime for the async
replicas. Thoughts?Do we need a GUC? Or should we just always require that sync rep is
satisfied before sending to async replicas?It feels like the sync quorum should always be ahead of the async
replicas. Unless I'm missing a use case, or there is some kind of
performance gotcha.
I don't have a strong opinion on whether there needs to be a GUC, but
+1 for the ability to enforce sync quorum before sending WAL to async
standbys. I think that would be a reasonable default behavior.
Nathan
On Fri, Jan 7, 2022 at 12:54 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
I would like to propose a GUC send_Wal_after_quorum_committed which
when set to ON, walsenders corresponds to async standbys and logical
replication workers wait until the LSN is quorum committed on the
primary before sending it to the standby. This not only simplifies
the post failover steps but avoids unnecessary downtime for the async
replicas. Thoughts?Do we need a GUC? Or should we just always require that sync rep is
satisfied before sending to async replicas?It feels like the sync quorum should always be ahead of the async
replicas. Unless I'm missing a use case, or there is some kind of
performance gotcha.
IMO, having GUC is a reasonable choice because some users might be
okay with it if their async replicas are ahead of the sync ones or
they would have dealt with this problem already in their HA solutions
or they don't want their async replicas to fall behind by the primary
(most of the times).
If there are long running txns on the primary and the async standbys
were to wait until quorum commit from sync standbys, won't they fall
behind the primary by too much? This isn't a problem at all if we
think from the perspective that async replicas are anyways prone to
falling behind by the primary. But, if the primary is having long
running txns continuously, the async replicas would eventually fall
behind more and more. Is there a way we can send the WAL records to
both sync and async replicas together but the async replicas won't
apply those WAL records until primary tells the standbys that quorum
commit is obtained? If the quorum commit isn't obtained by the
primary, the async replicas can ignore to apply the WAL records and
discard them.
Regards,
Bharath Rupireddy.
On Sat, 2022-01-08 at 00:13 +0530, Bharath Rupireddy wrote:
If there are long running txns on the primary and the async standbys
were to wait until quorum commit from sync standbys, won't they fall
behind the primary by too much?
No, because replication is based on LSNs, not transactions.
With the proposed change: an LSN can be replicated to all sync replicas
as soon as it's durable on the primary; and an LSN can be replicated to
all async replicas as soon as it's durable on the primary *and* the
sync rep quorum is satisfied.
Regards,
Jeff Davis
Hi,
On 2022-01-06 23:24:40 -0800, Jeff Davis wrote:
It feels like the sync quorum should always be ahead of the async
replicas. Unless I'm missing a use case, or there is some kind of
performance gotcha.
I don't see how it can *not* cause a major performance / latency
gotcha. You're deliberately delaying replication after all?
Synchronous replication doesn't guarantee *anything* about the ability for to
fail over for other replicas. Nor would it after what's proposed here -
another sync replica would still not be guaranteed to be able to follow the
newly promoted primary.
To me this just sounds like trying to shoehorn something into syncrep that
it's not made for.
Greetings,
Andres Freund
On Fri, 2022-01-07 at 12:22 -0800, Andres Freund wrote:
I don't see how it can *not* cause a major performance / latency
gotcha. You're deliberately delaying replication after all?
Are there use cases where someone wants sync rep, and also wants their
read replicas to get ahead of the sync rep quorum?
If the use case doesn't exist, it doesn't make sense to talk about how
well it performs.
another sync replica would still not be guaranteed to be able to
follow the
newly promoted primary.
If you only promote the furthest-ahead sync replica (which is what you
should be doing if you have quorum commit), why wouldn't that work?
To me this just sounds like trying to shoehorn something into syncrep
that
it's not made for.
What *is* sync rep made for?
The only justification in the docs is around durability:
"[sync rep] extends that standard level of durability offered by a
transaction commit... [sync rep] can provide a much higher level of
durability..."
If we take that at face value, then it seems logical to say that async
read replicas should not get ahead of sync replicas.
Regards,
Jeff Davis
Hi,
On 2022-01-07 14:36:46 -0800, Jeff Davis wrote:
On Fri, 2022-01-07 at 12:22 -0800, Andres Freund wrote:
I don't see how it can *not* cause a major performance / latency
gotcha. You're deliberately delaying replication after all?Are there use cases where someone wants sync rep, and also wants their
read replicas to get ahead of the sync rep quorum?
Yes. Not in the sense of being ahead of the sync replicas, but in the sense of
being as cought up as possible, and to keep the lost WAL in case of crashes as
low as possible.
another sync replica would still not be guaranteed to be able to
follow the
newly promoted primary.If you only promote the furthest-ahead sync replica (which is what you
should be doing if you have quorum commit), why wouldn't that work?
Remove "sync" from the above sentence, and the sentence holds true for
combinations of sync/async replicas as well.
To me this just sounds like trying to shoehorn something into syncrep
that
it's not made for.What *is* sync rep made for?
The only justification in the docs is around durability:
"[sync rep] extends that standard level of durability offered by a
transaction commit... [sync rep] can provide a much higher level of
durability..."
What is being proposed here doesn't increase durability. It *reduces* it -
it's less likely that WAL is replicated before a crash.
This is a especially relevant in cases where synchronous_commit=on vs local is
used selectively - after this change the durability of local changes is very
substantially *reduced* because they have to wait for the sync replicas before
also replicated to async replicas, but the COMMIT doesn't wait for
replication. So this "feature" just reduces the durability of such commits.
The performance overhead of syncrep is high enough that plenty real-world
usages cannot afford to use it for all transactions. And that's normally fine
from a business logic POV - often the majority of changes aren't that
important. It's non-trivial from an application implementation POV though, but
that's imo a separate concern.
If we take that at face value, then it seems logical to say that async
read replicas should not get ahead of sync replicas.
I don't see that. This presumes that WAL replicated to async replicas is
somehow bad. But pg_rewind exist, async replicas can be promoted and WAL from
the async replicas can be transferred to the synchronous replicas if only
those should be promoted.
Greetings,
Andres Freund
On Fri, 2022-01-07 at 14:54 -0800, Andres Freund wrote:
If you only promote the furthest-ahead sync replica (which is what
you
should be doing if you have quorum commit), why wouldn't that work?Remove "sync" from the above sentence, and the sentence holds true
for
combinations of sync/async replicas as well.
Technically that's true, but it seems like a bit of a strange use case.
I would think people doing that would just include those async replicas
in the sync quorum instead.
The main case I can think of for a mix of sync and async replicas are
if they are just managed differently. For instance, the sync replica
quorum is managed for a core part of the system, strategically
allocated on good hardware in different locations to minimize the
chance of dependent failures; while the async read replicas are
optional for taking load off the primary, and may appear/disappear in
whatever location and on whatever hardware is most convenient.
But if an async replica can get ahead of the sync rep quorum, then the
most recent transactions can appear in query results, so that means the
WAL shouldn't be lost, and the async read replicas become a part of the
durability model.
If the async read replica can't be promoted because it's not suitable
(due to location, hardware, whatever), then you need to frantically
copy the final WAL records out to an instance in the sync rep quorum.
That requires extra ceremony for every failover, and might be dubious
depending on how safe the WAL on your async read replicas is, and
whether there are dependent failure risks.
Yeah, I guess there could be some use case woven amongst those caveats,
but I'm not sure if anyone is actually doing that combination of things
safely today. If someone is, it would be interesting to know more about
that use case.
The proposal in this thread is quite a bit simpler: manage your sync
quorum and your async read replicas separately, and keep the sync rep
quorum ahead.
To me this just sounds like trying to shoehorn something into
syncrep
that
it's not made for.What *is* sync rep made for?
This was a sincere question and an answer would be helpful. I think
many of the discussions about sync rep get derailed because people have
different ideas about when and how it should be used, and the
documentation is pretty light.
This is a especially relevant in cases where synchronous_commit=on vs
local is
used selectively
That's an interesting point.
However, it's hard for me to reason about "kinda durable" and "a little
more durable" and I'm not sure how many people would care about that
distinction.
I don't see that. This presumes that WAL replicated to async replicas
is
somehow bad.
Simple case: primary and async read replica are in the same server
rack. Sync replicas are geographically distributed with quorum commit.
Read replica gets the WAL first (because it's closest), starts
answering queries that include that WAL, and then the entire rack
catches fire. Now you've returned results to the client, but lost the
transactions.
Regards,
Jeff Davis
On Fri, Jan 7, 2022 at 4:52 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2022-01-07 at 14:54 -0800, Andres Freund wrote:
If you only promote the furthest-ahead sync replica (which is what
you
should be doing if you have quorum commit), why wouldn't that work?Remove "sync" from the above sentence, and the sentence holds true
for
combinations of sync/async replicas as well.Technically that's true, but it seems like a bit of a strange use case.
I would think people doing that would just include those async replicas
in the sync quorum instead.The main case I can think of for a mix of sync and async replicas are
if they are just managed differently. For instance, the sync replica
quorum is managed for a core part of the system, strategically
allocated on good hardware in different locations to minimize the
chance of dependent failures; while the async read replicas are
optional for taking load off the primary, and may appear/disappear in
whatever location and on whatever hardware is most convenient.But if an async replica can get ahead of the sync rep quorum, then the
most recent transactions can appear in query results, so that means the
WAL shouldn't be lost, and the async read replicas become a part of the
durability model.If the async read replica can't be promoted because it's not suitable
(due to location, hardware, whatever), then you need to frantically
copy the final WAL records out to an instance in the sync rep quorum.
That requires extra ceremony for every failover, and might be dubious
depending on how safe the WAL on your async read replicas is, and
whether there are dependent failure risks.
This may not even be possible always as described in the scenario below.
Yeah, I guess there could be some use case woven amongst those caveats,
but I'm not sure if anyone is actually doing that combination of things
safely today. If someone is, it would be interesting to know more about
that use case.The proposal in this thread is quite a bit simpler: manage your sync
quorum and your async read replicas separately, and keep the sync rep
quorum ahead.To me this just sounds like trying to shoehorn something into
syncrep
that
it's not made for.What *is* sync rep made for?
This was a sincere question and an answer would be helpful. I think
many of the discussions about sync rep get derailed because people have
different ideas about when and how it should be used, and the
documentation is pretty light.This is a especially relevant in cases where synchronous_commit=on vs
local is
used selectivelyThat's an interesting point.
However, it's hard for me to reason about "kinda durable" and "a little
more durable" and I'm not sure how many people would care about that
distinction.I don't see that. This presumes that WAL replicated to async replicas
is
somehow bad.Simple case: primary and async read replica are in the same server
rack. Sync replicas are geographically distributed with quorum commit.
Read replica gets the WAL first (because it's closest), starts
answering queries that include that WAL, and then the entire rack
catches fire. Now you've returned results to the client, but lost the
transactions.
Another similar example is, in a multi-AZ HA setup, primary and sync
replicas are deployed in two different availability zones and the async
replicas for reads can be in any availability zone and assume the async
replica and primary land in the same AZ. Primary availability zone going
down leads to both primary and async replica going down at the same time.
This async replica could be ahead of sync replica and WAL can't be
collected as both primary and async replica failed together.
Show quoted text
Regards,
Jeff Davis
At Fri, 7 Jan 2022 09:44:15 -0800, SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote in
On Fri, Jan 7, 2022 at 12:27 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:One is to serialize WAL sending (of course it is unacceptable at all)
or aotehr is to send WAL to all standbys at once then make the
decision after making sure receiving replies from all standbys (this
is no longer quorum commit in another sense..)There is no need to serialize sending the WAL among sync standbys. The only
serialization required is first to all the sync replicas and then to sync
replicas if any. Once an LSN is quorum committed, no failover subsystem
initiates an automatic failover such that the LSN is lost (data loss)
Sync standbys on PostgreSQL is ex post facto. When a certain set of
standbys have first reported catching-up for a commit, they are called
"sync standbys".
We can maintain a fixed set of sync standbys based on the set of
sync-standbys at a past commits, but that implies performance
degradation even if not a single standby is gone.
If we send WAL only to the fixed-set of sync standbys, when any of the
standbys is gone, the primary is forced to wait until some timeout
expires. The same commit would finish immediately if WAL had been
sent also to out-of-quorum standbys.
So I'm afraid that there's no sensible solution to avoid the
hiding-forerunner problem on quorum commit.Could you elaborate on the problem here?
If a primary have received response for LSN=X from N standbys, that
fact doesn't guarantee that none of the other standbys reached the
same LSN. If one of the yet-unresponded standbys already reached
LSN=X+10 but its response does not arrived to the primary for some
reasons, the true-fastest standby is hiding from primary.
Even if the primary examines the responses from all standbys, it is
uncertain if the responses reflect the truly current state of the
standbys. Thus if we want to guarantee that no unresponded standby is
going beyond LSN=X, there's no means other than we refrain from
sending WAL beyond X. In that case, we need to serialize the period
from WAL-sending to response-reception, which would lead to critical
performance degradation.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Jan 6, 2022 at 1:29 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
Consider a cluster formation where we have a Primary(P), Sync Replica(S1), and multiple async replicas for disaster recovery and read scaling (within the region and outside the region). In this setup, S1 is the preferred failover target in an event of the primary failure. When a transaction is committed on the primary, it is not acknowledged to the client until the primary gets an acknowledgment from the sync standby that the WAL is flushed to the disk (assume synchrnous_commit configuration is remote_flush). However, walsenders corresponds to the async replica on the primary don't wait for the flush acknowledgment from the primary and send the WAL to the async standbys (also any logical replication/decoding clients). So it is possible for the async replicas and logical client ahead of the sync replica. If a failover is initiated in such a scenario, to bring the formation into a healthy state we have to either
run the pg_rewind on the async replicas for them to reconnect with the new primary or
collect the latest WAL across the replicas and feed the standby.Both these operations are involved, error prone, and can cause multiple minutes of downtime if done manually. In addition, there is a window where the async replicas can show the data that was neither acknowledged to the client nor committed on standby. Logical clients if they are ahead may need to reseed the data as no easy rewind option for them.
I would like to propose a GUC send_Wal_after_quorum_committed which when set to ON, walsenders corresponds to async standbys and logical replication workers wait until the LSN is quorum committed on the primary before sending it to the standby. This not only simplifies the post failover steps but avoids unnecessary downtime for the async replicas. Thoughts?
Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.
I've done pgbench testing to see if the patch causes any problems. I
ran tests two times, there isn't much difference in the txns per
seconds (tps), although there's a delay in the async standby receiving
the WAL, after all, that's the feature we are pursuing.
[1]: HEAD or WITHOUT PATCH: ./pgbench -c 10 -t 500 -P 10 testdb transaction type: <builtin: TPC-B (sort of)> scaling factor: 100 query mode: simple number of clients: 10 number of threads: 1 number of transactions per client: 500 number of transactions actually processed: 5000/5000 latency average = 247.395 ms latency stddev = 74.409 ms initial connection time = 13.622 ms tps = 39.713114 (without initial connection time)
HEAD or WITHOUT PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 247.395 ms
latency stddev = 74.409 ms
initial connection time = 13.622 ms
tps = 39.713114 (without initial connection time)
PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 251.757 ms
latency stddev = 72.846 ms
initial connection time = 13.025 ms
tps = 39.315862 (without initial connection time)
TEST SETUP:
primary in region 1
async standby 1 in the same region as that of the primary region 1
i.e. close to primary
sync standby 1 in region 2
sync standby 2 in region 3
an archive location in a region different from the primary and
standbys regions, region 4
Note that I intentionally kept sync standbys in regions far from
primary because it allows sync standbys to receive WAL a bit late by
default, which works well for our testing.
PGBENCH SETUP:
./psql -d postgres -c "drop database testdb"
./psql -d postgres -c "create database testdb"
./pgbench -i -s 100 testdb
./psql -d testdb -c "\dt"
./psql -d testdb -c "SELECT pg_size_pretty(pg_database_size('testdb'))"
./pgbench -c 10 -t 500 -P 10 testdb
Regards,
Bharath Rupireddy.
Attachments:
v1-0001-Allow-async-standbys-wait-for-sync-replication.patchapplication/octet-stream; name=v1-0001-Allow-async-standbys-wait-for-sync-replication.patchDownload
From b7abfc772ce1b0ce66a0854241a290cf3beb91eb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 15 Feb 2022 14:56:33 +0000
Subject: [PATCH v1] Allow async standbys wait for sync replication
---
doc/src/sgml/config.sgml | 24 +++
doc/src/sgml/monitoring.sgml | 4 +
src/backend/replication/walsender.c | 155 ++++++++++++++++++
src/backend/utils/activity/wait_event.c | 3 +
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 3 +
src/include/replication/walsender.h | 1 +
src/include/utils/wait_event.h | 3 +-
8 files changed, 201 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 53b361e7a9..63e8c226cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4245,6 +4245,30 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</listitem>
</varlistentry>
+ <varlistentry id="guc-async-standbys-wait-for-sync_replication" xreflabel="async_standbys_wait_for_sync_replication">
+ <term><varname>async_standbys_wait_for_sync_replication</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>async_standbys_wait_for_sync_replication</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When set, the primary will never let asynchronous standbys fall ahead
+ of the synchronous standbys. In other words, the asynchronous standbys
+ will wait until the synchronous replication i.e., after at least the
+ standbys (either in quorum or priority based) receive the WAL, flush
+ and acknowledge the primary with the flush LSN. This behvaiour is
+ particularly important as it avoids extra manual steps required on the
+ asynchronous standbys which are ahead of the synchronous standbys in
+ the event of failover.
+ </para>
+ <para>
+ This parameter can only be set in the <filename>postgresql.conf</filename>
+ file or on the server command line.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..52d58d79c4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1170,6 +1170,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</thead>
<tbody>
+ <row>
+ <entry><literal>AsyncWalSenderWaitForSyncReplication</literal></entry>
+ <entry>Asynchronous standby waiting in WAL sender, before sending WAL, for synchronous standbys to flush the WAL.</entry>
+ </row>
<row>
<entry><literal>ClientRead</literal></entry>
<entry>Waiting to read data from the client.</entry>
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 655760fee3..79920f0212 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -124,6 +124,8 @@ int wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
* data message */
bool log_replication_commands = false;
+bool async_standbys_wait_for_sync_replication = true;
+
/*
* State for WalSndWakeupRequest
*/
@@ -257,6 +259,9 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
TimeLineID *tli_p);
+static bool AmAsyncStandbyWALSender(void);
+static bool ShouldWaitForSyncRepl(void);
+static void WaitForSyncRepl(XLogRecPtr sendRqstPtr);
/* Initialize walsender process before entering the main command loop */
void
@@ -2836,6 +2841,8 @@ XLogSendPhysical(void)
return;
}
+ WaitForSyncRepl(SendRqstPtr);
+
/*
* Figure out how much to send in one message. If there's no more than
* MAX_SEND_SIZE bytes to send, send everything. Otherwise send
@@ -2994,6 +3001,22 @@ XLogSendLogical(void)
if (record != NULL)
{
+ /*
+ * At this point, we do not know whether the current LSN (ReadRecPtr)
+ * is required by any of the logical decoding output plugins which is
+ * only known at the plugin level. If we were to decide whether to wait
+ * or not for the synchronous standbys flush LSN at the plugin level,
+ * we might have to pass extra information to it which doesn't sound an
+ * elegant way.
+ *
+ * Another way the output plugins can wait there before sending the WAL
+ * is by reading the flush LSN from the logical replication slots.
+ *
+ * Waiting here i.e. before even the logical decoding kicks in, makes
+ * the code clean.
+ */
+ WaitForSyncRepl(logical_decoding_ctx->reader->ReadRecPtr);
+
/*
* Note the lack of any call to LagTrackerWrite() which is handled by
* WalSndUpdateProgress which is called by output plugin through
@@ -3816,3 +3839,135 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
Assert(time != 0);
return now - time;
}
+
+/*
+ * Check if the WAL sender is serving an asynchronous standby at the moment.
+ */
+static bool
+AmAsyncStandbyWALSender(void)
+{
+ int priority;
+
+ SpinLockAcquire(&MyWalSnd->mutex);
+ priority = MyWalSnd->sync_standby_priority;
+ SpinLockRelease(&MyWalSnd->mutex);
+
+ Assert(priority >= 0);
+
+ if (priority > 0)
+ ereport(DEBUG3,
+ (errmsg("WAL sender is serving a sync standby at the moment with priority %d",
+ priority)));
+ else if (priority == 0)
+ ereport(DEBUG3,
+ (errmsg("WAL sender is serving an async standby at the moment with priority %d",
+ priority)));
+
+ return priority == 0 ? true : false;
+}
+
+/*
+ * Check if the WAL sender should wait for synchronous replication.
+ */
+static bool
+ShouldWaitForSyncRepl(void)
+{
+ /* GUC is set to off, so no need to wait */
+ if (!async_standbys_wait_for_sync_replication)
+ return false;
+
+ /* Synchronous replication is not requested, so no need to wait */
+ if (SyncRepStandbyNames == NULL ||
+ SyncRepStandbyNames[0] == '\0' ||
+ SyncRepConfig == NULL)
+ return false;
+
+ /*
+ * WAL sender is serving a sync standby at the moment, so no need to wait.
+ */
+ if (!AmAsyncStandbyWALSender())
+ return false;
+
+ /*
+ * WAL sender is serving an async standby at the moment, so it should wait.
+ */
+ return true;
+}
+
+/*
+ * Wait until the sendRqstPtr is at least the flush LSN of all synchronous
+ * standbys.
+ */
+static void
+WaitForSyncRepl(XLogRecPtr sendRqstPtr)
+{
+ XLogRecPtr flushLSN;
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+
+ Assert(walsndctl != NULL);
+ Assert(!XLogRecPtrIsInvalid(sendRqstPtr));
+
+ if (!ShouldWaitForSyncRepl())
+ return;
+
+ LWLockAcquire(SyncRepLock, LW_SHARED);
+ flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
+ LWLockRelease(SyncRepLock);
+
+ /*
+ * This WAL sender is allowed to send the WAL as its request LSN is equal
+ * or behind the sync standbys flush LSN.
+ */
+ if (sendRqstPtr <= flushLSN)
+ {
+ ereport(DEBUG3,
+ (errmsg("async standby WAL sender is allowed to send WAL as its request LSN %X/%X is equal or behind sync standbys flush LSN %X/%X",
+ LSN_FORMAT_ARGS(sendRqstPtr), LSN_FORMAT_ARGS(flushLSN)),
+ errhidestmt(true)));
+
+ return;
+ }
+
+ ereport(DEBUG3,
+ (errmsg("async standby WAL sender with request LSN %X/%X is waiting as sync standbys are ahead with flush LSN %X/%X",
+ LSN_FORMAT_ARGS(flushLSN), LSN_FORMAT_ARGS(sendRqstPtr)),
+ errhidestmt(true)));
+
+ for (;;)
+ {
+ /*
+ * It is enough to wait until the flush LSN of sync standbys as it
+ * guarantees local durable commit, standby durable commit after both
+ * server and OS crash, thus no transaction losses.
+ *
+ * XXX: we could go further to make this optional with
+ * synchronous_commit setting. That is, the async standbys can be made
+ * to wait either for remote write or flush or apply LSN. But waiting
+ * for remote flush LSN suffices to give maximum protection for us.
+ */
+ LWLockAcquire(SyncRepLock, LW_SHARED);
+ flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
+ LWLockRelease(SyncRepLock);
+
+ if (sendRqstPtr <= flushLSN)
+ break;
+
+ /*
+ * Wait for a brief moment i.e. 10 msec before trying again.
+ * Intentionally chose a shorter wait time as we don't want to wait
+ * longer durations and be aggressive in reading the flush LSN.
+ */
+ WalSndWait(WL_SOCKET_WRITEABLE | WL_SOCKET_READABLE, 10L,
+ WAIT_EVENT_ASYNC_WAL_SENDER_WAIT_FOR_SYNC_REPLICATION);
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ ereport(DEBUG3,
+ (errmsg("async standby WAL sender with request LSN %X/%X can now send WAL up to sync standbys flush LSN %X/%X",
+ LSN_FORMAT_ARGS(sendRqstPtr), LSN_FORMAT_ARGS(flushLSN)),
+ errhidestmt(true)));
+}
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60972c3a75..745c3976a8 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -267,6 +267,9 @@ pgstat_get_wait_client(WaitEventClient w)
switch (w)
{
+ case WAIT_EVENT_ASYNC_WAL_SENDER_WAIT_FOR_SYNC_REPLICATION:
+ event_name = "AsyncWalSenderWaitForSyncReplication";
+ break;
case WAIT_EVENT_CLIENT_READ:
event_name = "ClientRead";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e2fe219aa8..a4050ba8d5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1381,6 +1381,15 @@ static struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"async_standbys_wait_for_sync_replication", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets whether asynchronous standbys should wait until synchronous standbys receive and flush WAL."),
+ NULL
+ },
+ &async_standbys_wait_for_sync_replication,
+ true,
+ NULL, NULL, NULL
+ },
{
{"debug_assertions", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether the running server has assertion checks enabled."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 56d0bee6d9..f9de74d4e8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -308,6 +308,9 @@
#wal_sender_timeout = 60s # in milliseconds; 0 disables
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
+#async_standbys_wait_for_sync_replication = on # asynchronous standbys wait
+ # until synchronous standbys receive, flush WAL and
+ # confirm primary
# - Primary Server -
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index b1892e9e4b..61b113ae48 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -34,6 +34,7 @@ extern bool wake_wal_senders;
extern int max_wal_senders;
extern int wal_sender_timeout;
extern bool log_replication_commands;
+extern bool async_standbys_wait_for_sync_replication;
extern void InitWalSender(void);
extern bool exec_replication_command(const char *query_string);
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 395d325c5f..fbe2b66368 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -60,7 +60,8 @@ typedef enum
*/
typedef enum
{
- WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+ WAIT_EVENT_ASYNC_WAL_SENDER_WAIT_FOR_SYNC_REPLICATION = PG_WAIT_CLIENT,
+ WAIT_EVENT_CLIENT_READ,
WAIT_EVENT_CLIENT_WRITE,
WAIT_EVENT_GSS_OPEN_SERVER,
WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
--
2.25.1
On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote:
Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.
I haven't had a chance to look too closely yet, but IIUC this adds a new
function that waits for synchronous replication. This new function
essentially spins until the synchronous LSN has advanced.
I don't think it's a good idea to block sending any WAL like this. AFAICT
it is possible that there will be a lot of synchronously replicated WAL
that we can send, and it might just be the last several bytes that cannot
yet be replicated to the asynchronous standbys. І believe this patch will
cause the server to avoid sending _any_ WAL until the synchronous LSN
advances.
Perhaps we should instead just choose the SendRqstPtr based on the current
synchronous LSN. Presumably there are other things we'd need to consider,
but in general, I think we ought to send as much WAL as possible for a
given call to XLogSendPhysical().
I've done pgbench testing to see if the patch causes any problems. I
ran tests two times, there isn't much difference in the txns per
seconds (tps), although there's a delay in the async standby receiving
the WAL, after all, that's the feature we are pursuing.
I'm curious what a longer pgbench run looks like when the synchronous
replicas are in the same region. That is probably a more realistic
use-case.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hello,
On 2/25/22 11:38 AM, Nathan Bossart wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote:
Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.I haven't had a chance to look too closely yet, but IIUC this adds a new
function that waits for synchronous replication. This new function
essentially spins until the synchronous LSN has advanced.I don't think it's a good idea to block sending any WAL like this. AFAICT
it is possible that there will be a lot of synchronously replicated WAL
that we can send, and it might just be the last several bytes that cannot
yet be replicated to the asynchronous standbys. І believe this patch will
cause the server to avoid sending _any_ WAL until the synchronous LSN
advances.Perhaps we should instead just choose the SendRqstPtr based on the current
synchronous LSN. Presumably there are other things we'd need to consider,
but in general, I think we ought to send as much WAL as possible for a
given call to XLogSendPhysical().
I think you're right that we'll avoid sending any WAL until sync_lsn
advances. We could setup a contrived situation where the async-walsender
never advances because it terminates before the flush_lsn of the
synchronous_node catches up. And when the async-walsender restarts,
it'll start with the latest flushed on the primary and we could go into
a perpetual loop.
I took a look at the patch and tested basic streaming with async
replicas ahead of the synchronous standby and with logical clients as
well and it works as expected.
ereport(LOG,
(errmsg("async standby WAL sender with request LSN %X/%X
is waiting as sync standbys are ahead with flush LSN %X/%X",
LSN_FORMAT_ARGS(flushLSN),
LSN_FORMAT_ARGS(sendRqstPtr)),
errhidestmt(true)));
I think this log formatting is incorrect.
s/sync standbys are ahead/sync standbys are behind/ and I think you need
to swap flushLsn and sendRqstPtr
When a walsender is waiting for the lsn on the synchronous replica to
advance and a database stop is issued to the writer, the pg_ctl stop
isn't able to proceed and the database seems to never shutdown.
Assert(priority >= 0);
What's the point of the assert here?
Also the comments/code refer to AsyncStandbys, however it's also used
for logical clients, which may or may not be standbys. Don't feel too
strongly about the naming here but something to note.
if (!ShouldWaitForSyncRepl())
return;
...
for (;;)
{
// rest of work
}
If we had a walsender already waiting for an ack, and the conditions of
ShouldWaitForSyncRepl() change, such as disabling
async_standbys_wait_for_sync_replication or synchronous replication
it'll still wait since we never re-check the condition.
postgres=# select wait_event from pg_stat_activity where wait_event like
'AsyncWal%';
wait_event
--------------------------------------
AsyncWalSenderWaitForSyncReplication
AsyncWalSenderWaitForSyncReplication
AsyncWalSenderWaitForSyncReplication
(3 rows)
postgres=# show synchronous_standby_names;
synchronous_standby_names
---------------------------
(1 row)
postgres=# show async_standbys_wait_for_sync_replication;
async_standbys_wait_for_sync_replication
------------------------------------------
off
(1 row)
LWLockAcquire(SyncRepLock, LW_SHARED);
flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);
Should we configure this similar to the user's setting of
synchronous_commit instead of just flush? (SYNC_REP_WAIT_WRITE,
SYNC_REP_WAIT_APPLY)
Thanks,
John H
On Sat, Feb 26, 2022 at 1:08 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote:
Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.I haven't had a chance to look too closely yet, but IIUC this adds a new
function that waits for synchronous replication. This new function
essentially spins until the synchronous LSN has advanced.I don't think it's a good idea to block sending any WAL like this. AFAICT
it is possible that there will be a lot of synchronously replicated WAL
that we can send, and it might just be the last several bytes that cannot
yet be replicated to the asynchronous standbys. І believe this patch will
cause the server to avoid sending _any_ WAL until the synchronous LSN
advances.Perhaps we should instead just choose the SendRqstPtr based on the current
synchronous LSN. Presumably there are other things we'd need to consider,
but in general, I think we ought to send as much WAL as possible for a
given call to XLogSendPhysical().
A global min LSN of SendRqstPtr of all the sync standbys can be
calculated and the async standbys can send WAL up to global min LSN.
This is unlike what the v1 patch does i.e. async standbys will wait
until the sync standbys report flush LSN back to the primary. Problem
with the global min LSN approach is that there can still be a small
window where async standbys can get ahead of sync standbys. Imagine
async standbys being closer to the primary than sync standbys and if
the failover has to happen while the WAL at SendRqstPtr isn't received
by the sync standbys, but the async standbys can receive them as they
are closer. We hit the same problem that we are trying to solve with
this patch. This is the reason, we are waiting till the sync flush LSN
as it guarantees more transactional protection.
Do you think allowing async standbys optionally wait for either remote
write or flush or apply or global min LSN of SendRqstPtr so that users
can choose what they want?
I've done pgbench testing to see if the patch causes any problems. I
ran tests two times, there isn't much difference in the txns per
seconds (tps), although there's a delay in the async standby receiving
the WAL, after all, that's the feature we are pursuing.I'm curious what a longer pgbench run looks like when the synchronous
replicas are in the same region. That is probably a more realistic
use-case.
We are performing more tests, I will share the results once done.
Regards,
Bharath Rupireddy.
On Sat, Feb 26, 2022 at 3:22 AM Hsu, John <hsuchen@amazon.com> wrote:
On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote:
Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.I haven't had a chance to look too closely yet, but IIUC this adds a new
function that waits for synchronous replication. This new function
essentially spins until the synchronous LSN has advanced.I don't think it's a good idea to block sending any WAL like this. AFAICT
it is possible that there will be a lot of synchronously replicated WAL
that we can send, and it might just be the last several bytes that cannot
yet be replicated to the asynchronous standbys. І believe this patch will
cause the server to avoid sending _any_ WAL until the synchronous LSN
advances.Perhaps we should instead just choose the SendRqstPtr based on the current
synchronous LSN. Presumably there are other things we'd need to consider,
but in general, I think we ought to send as much WAL as possible for a
given call to XLogSendPhysical().I think you're right that we'll avoid sending any WAL until sync_lsn
advances. We could setup a contrived situation where the async-walsender
never advances because it terminates before the flush_lsn of the
synchronous_node catches up. And when the async-walsender restarts,
it'll start with the latest flushed on the primary and we could go into
a perpetual loop.
The async walsender looks at flush LSN from
walsndctl->lsn[SYNC_REP_WAIT_FLUSH]; after it comes up and decides to
send the WAL up to it. If there are no sync replicats after it comes
up (users can make sync standbys async without postmaster restart
because synchronous_standby_names is effective with SIGHUP), then it
doesn't wait at all and continues to send WAL. I don't see any problem
with it. Am I missing something here?
I took a look at the patch and tested basic streaming with async
replicas ahead of the synchronous standby and with logical clients as
well and it works as expected.
Thanks for reviewing and testing the patch.
ereport(LOG,
(errmsg("async standby WAL sender with request LSN %X/%Xis waiting as sync standbys are ahead with flush LSN %X/%X",
LSN_FORMAT_ARGS(flushLSN),
LSN_FORMAT_ARGS(sendRqstPtr)),
errhidestmt(true)));
I think this log formatting is incorrect.
s/sync standbys are ahead/sync standbys are behind/ and I think you need
to swap flushLsn and sendRqstPtr
I will correct it. "async standby WAL sender with request LSN %X/%X is
waiting as sync standbys are ahead with flush LSN %X/%X",
LSN_FORMAT_ARGS(sendRqstP), LSN_FORMAT_ARGS(flushLSN). I will think
more about having better wording of these messages, any suggestions
here?
When a walsender is waiting for the lsn on the synchronous replica to
advance and a database stop is issued to the writer, the pg_ctl stop
isn't able to proceed and the database seems to never shutdown.
I too observed this once or twice. It looks like the walsender isn't
detecting postmaster death in for (;;) with WalSndWait. Not sure if
this is expected or true with other wait-loops in walsender code. Any
more thoughts here?
Assert(priority >= 0);
What's the point of the assert here?
Just for safety. I can remove it as the sync_standby_priority can
never be negative.
Also the comments/code refer to AsyncStandbys, however it's also used
for logical clients, which may or may not be standbys. Don't feel too
strongly about the naming here but something to note.
I will try to be more informative by adding something like "async
standbys and logical replication subscribers".
if (!ShouldWaitForSyncRepl())
return;
...
for (;;)
{
// rest of work
}If we had a walsender already waiting for an ack, and the conditions of
ShouldWaitForSyncRepl() change, such as disabling
async_standbys_wait_for_sync_replication or synchronous replication
it'll still wait since we never re-check the condition.
Yeah, I will add the checks inside the async walsender wait-loop.
postgres=# select wait_event from pg_stat_activity where wait_event like
'AsyncWal%';
wait_event
--------------------------------------
AsyncWalSenderWaitForSyncReplication
AsyncWalSenderWaitForSyncReplication
AsyncWalSenderWaitForSyncReplication
(3 rows)postgres=# show synchronous_standby_names;
synchronous_standby_names
---------------------------(1 row)
postgres=# show async_standbys_wait_for_sync_replication;
async_standbys_wait_for_sync_replication
------------------------------------------
off
(1 row)LWLockAcquire(SyncRepLock, LW_SHARED);
flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);Should we configure this similar to the user's setting of
synchronous_commit instead of just flush? (SYNC_REP_WAIT_WRITE,
SYNC_REP_WAIT_APPLY)
As I said upthread, we can allow async standbys optionally wait for
either remote write or flush or apply or global min LSN of SendRqstPtr
so that users can choose what they want. I'm open to more thoughts
here.
Regards,
Bharath Rupireddy.
On Sat, Feb 26, 2022 at 02:17:50PM +0530, Bharath Rupireddy wrote:
A global min LSN of SendRqstPtr of all the sync standbys can be
calculated and the async standbys can send WAL up to global min LSN.
This is unlike what the v1 patch does i.e. async standbys will wait
until the sync standbys report flush LSN back to the primary. Problem
with the global min LSN approach is that there can still be a small
window where async standbys can get ahead of sync standbys. Imagine
async standbys being closer to the primary than sync standbys and if
the failover has to happen while the WAL at SendRqstPtr isn't received
by the sync standbys, but the async standbys can receive them as they
are closer. We hit the same problem that we are trying to solve with
this patch. This is the reason, we are waiting till the sync flush LSN
as it guarantees more transactional protection.
Do you mean that the application of WAL gets ahead on your async standbys
or that the writing/flushing of WAL gets ahead? If synchronous_commit is
set to 'remote_write' or 'on', I think either approach can lead to
situations where the async standbys are ahead of the sync standbys with WAL
application. For example, a conflict between WAL replay and a query on
your sync standby could delay WAL replay, but the primary will not wait for
this conflict to resolve before considering a transaction synchronously
replicated and sending it to the async standbys.
If writing/flushing WAL gets ahead on async standbys, I think something is
wrong with the patch. If you aren't sending WAL to async standbys until
it is synchronously replicated to the sync standbys, it should by
definition be impossible for this to happen.
If you wanted to make sure that WAL was not applied to async standbys
before it was applied to sync standbys, I think you'd need to set
synchronous_commit to 'remote_apply'. This would ensure that the WAL is
replayed on sync standbys before the primary considers the transaction
synchronously replicated and sends it to the async standbys.
Do you think allowing async standbys optionally wait for either remote
write or flush or apply or global min LSN of SendRqstPtr so that users
can choose what they want?
I'm not sure I follow the difference between "global min LSN of
SendRqstPtr" and remote write/flush/apply. IIUC you are saying that we
could use the LSN of what is being sent to sync standbys instead of the LSN
of what the primary considers synchronously replicated. I don't think we
should do that because it provides no guarantee that the WAL has even been
sent to the sync standbys before it is sent to the async standbys. For
this feature, I think we always need to consider what the primary considers
synchronously replicated. My suggested approach doesn't change that. I'm
saying that instead of spinning in a loop waiting for the WAL to be
synchronously replicated, we just immediately send WAL up to the LSN that
is presently known to be synchronously replicated.
You do bring up an interesting point, though. Is there a use-case for
specifying synchronous_commit='on' but not sending WAL to async replicas
until it is synchronously applied? Or alternatively, would anyone want to
set synchronous_commit='remote_apply' but send WAL to async standbys as
soon as it is written to the sync standbys? My initial reaction is that we
should depend on the synchronous replication setup. As long as the primary
considers an LSN synchronously replicated, it would be okay to send it to
the async standbys. I personally don't think it is worth taking on the
extra complexity for that level of configuration just yet.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sat, Feb 26, 2022 at 9:37 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Sat, Feb 26, 2022 at 02:17:50PM +0530, Bharath Rupireddy wrote:
A global min LSN of SendRqstPtr of all the sync standbys can be
calculated and the async standbys can send WAL up to global min LSN.
This is unlike what the v1 patch does i.e. async standbys will wait
until the sync standbys report flush LSN back to the primary. Problem
with the global min LSN approach is that there can still be a small
window where async standbys can get ahead of sync standbys. Imagine
async standbys being closer to the primary than sync standbys and if
the failover has to happen while the WAL at SendRqstPtr isn't received
by the sync standbys, but the async standbys can receive them as they
are closer. We hit the same problem that we are trying to solve with
this patch. This is the reason, we are waiting till the sync flush LSN
as it guarantees more transactional protection.Do you mean that the application of WAL gets ahead on your async standbys
or that the writing/flushing of WAL gets ahead? If synchronous_commit is
set to 'remote_write' or 'on', I think either approach can lead to
situations where the async standbys are ahead of the sync standbys with WAL
application. For example, a conflict between WAL replay and a query on
your sync standby could delay WAL replay, but the primary will not wait for
this conflict to resolve before considering a transaction synchronously
replicated and sending it to the async standbys.If writing/flushing WAL gets ahead on async standbys, I think something is
wrong with the patch. If you aren't sending WAL to async standbys until
it is synchronously replicated to the sync standbys, it should by
definition be impossible for this to happen.
With the v1 patch [1]/messages/by-id/CALj2ACVUa8WddVDS20QmVKNwTbeOQqy4zy59NPzh8NnLipYZGw@mail.gmail.com, the async standbys will never get WAL ahead of
sync standbys. That is guaranteed because the walsenders serving async
standbys are allowed to send WAL only after the walsenders serving
sync standbys receive the synchronous flush LSN.
Do you think allowing async standbys optionally wait for either remote
write or flush or apply or global min LSN of SendRqstPtr so that users
can choose what they want?I'm not sure I follow the difference between "global min LSN of
SendRqstPtr" and remote write/flush/apply. IIUC you are saying that we
could use the LSN of what is being sent to sync standbys instead of the LSN
of what the primary considers synchronously replicated. I don't think we
should do that because it provides no guarantee that the WAL has even been
sent to the sync standbys before it is sent to the async standbys.
Correct.
For
this feature, I think we always need to consider what the primary considers
synchronously replicated. My suggested approach doesn't change that. I'm
saying that instead of spinning in a loop waiting for the WAL to be
synchronously replicated, we just immediately send WAL up to the LSN that
is presently known to be synchronously replicated.
As I said above, v1 patch does that i.e. async standbys wait until the
sync standbys update their flush LSN.
Flush LSN is this - flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
which gets updated in SyncRepReleaseWaiters.
Async standbys with their SendRqstPtr will wait in XLogSendPhysical or
XLogSendLogical until SendRqstPtr <= flushLSN.
I will address review comments raised by Hsu, John and send the
updated patch for further review. Thanks.
[1]: /messages/by-id/CALj2ACVUa8WddVDS20QmVKNwTbeOQqy4zy59NPzh8NnLipYZGw@mail.gmail.com
Regards,
Bharath Rupireddy.
On Mon, Feb 28, 2022 at 06:45:51PM +0530, Bharath Rupireddy wrote:
On Sat, Feb 26, 2022 at 9:37 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
For
this feature, I think we always need to consider what the primary considers
synchronously replicated. My suggested approach doesn't change that. I'm
saying that instead of spinning in a loop waiting for the WAL to be
synchronously replicated, we just immediately send WAL up to the LSN that
is presently known to be synchronously replicated.As I said above, v1 patch does that i.e. async standbys wait until the
sync standbys update their flush LSN.Flush LSN is this - flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
which gets updated in SyncRepReleaseWaiters.Async standbys with their SendRqstPtr will wait in XLogSendPhysical or
XLogSendLogical until SendRqstPtr <= flushLSN.
My feedback is specifically about this behavior. I don't think we should
spin in XLogSend*() waiting for an LSN to be synchronously replicated. I
think we should just choose the SendRqstPtr based on what is currently
synchronously replicated.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
The async walsender looks at flush LSN from
walsndctl->lsn[SYNC_REP_WAIT_FLUSH]; after it comes up and decides to
send the WAL up to it. If there are no sync replicats after it comes
up (users can make sync standbys async without postmaster restart
because synchronous_standby_names is effective with SIGHUP), then it
doesn't wait at all and continues to send WAL. I don't see any problem
with it. Am I missing something here? Assuming I understand the code correctly, we have: > SendRqstPtr =
GetFlushRecPtr(NULL); In this contrived example let's say
walsndctl->lsn[SYNC_REP_WAIT_FLUSH] is always 60s behind
GetFlushRecPtr() and for whatever reason, if the walsender hasn't
replicated anything in 30s it'll terminate and re-connect. If
GetFlushRecPtr() keeps advancing and is always 60s ahead of the sync
LSN's then we would never stream anything, even though it's advanced
past what is safe to stream previously.
I will correct it. "async standby WAL sender with request LSN %X/%X is > waiting as sync standbys are ahead with flush LSN %X/%X", >
LSN_FORMAT_ARGS(sendRqstP), LSN_FORMAT_ARGS(flushLSN). I will think >
more about having better wording of these messages, any suggestions > here?
"async standby WAL sender with request LSN %X/%X is waiting for sync
standbys at LSN %X/%X to advance past it" Not sure if that's really
clearer...
I too observed this once or twice. It looks like the walsender isn't
detecting postmaster death in for (;;) with WalSndWait. Not sure if >
this is expected or true with other wait-loops in walsender code. Any >
more thoughts here? Unfortunately I haven't had a chance to dig into it
more although iirc I hit it fairly often. Thanks, John H
On Tue, Mar 1, 2022 at 12:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Feb 28, 2022 at 06:45:51PM +0530, Bharath Rupireddy wrote:
On Sat, Feb 26, 2022 at 9:37 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
For
this feature, I think we always need to consider what the primary considers
synchronously replicated. My suggested approach doesn't change that. I'm
saying that instead of spinning in a loop waiting for the WAL to be
synchronously replicated, we just immediately send WAL up to the LSN that
is presently known to be synchronously replicated.As I said above, v1 patch does that i.e. async standbys wait until the
sync standbys update their flush LSN.Flush LSN is this - flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
which gets updated in SyncRepReleaseWaiters.Async standbys with their SendRqstPtr will wait in XLogSendPhysical or
XLogSendLogical until SendRqstPtr <= flushLSN.My feedback is specifically about this behavior. I don't think we should
spin in XLogSend*() waiting for an LSN to be synchronously replicated. I
think we should just choose the SendRqstPtr based on what is currently
synchronously replicated.
Do you mean something like the following?
/* Main loop of walsender process that streams the WAL over Copy messages. */
static void
WalSndLoop(WalSndSendDataCallback send_data)
{
/*
* Loop until we reach the end of this timeline or the client requests to
* stop streaming.
*/
for (;;)
{
if (am_async_walsender && there_are_sync_standbys)
{
XLogRecPtr SendRqstLSN;
XLogRecPtr SyncFlushLSN;
SendRqstLSN = GetFlushRecPtr(NULL);
LWLockAcquire(SyncRepLock, LW_SHARED);
SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);
if (SendRqstLSN > SyncFlushLSN)
continue;
}
if (!pq_is_send_pending())
send_data(); /* THIS IS WHERE XLogSendPhysical or
XLogSendLogical gets called */
else
WalSndCaughtUp = false;
}
Regards,
Bharath Rupireddy.
On Tue, Mar 01, 2022 at 11:10:09AM +0530, Bharath Rupireddy wrote:
On Tue, Mar 1, 2022 at 12:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
My feedback is specifically about this behavior. I don't think we should
spin in XLogSend*() waiting for an LSN to be synchronously replicated. I
think we should just choose the SendRqstPtr based on what is currently
synchronously replicated.Do you mean something like the following?
/* Main loop of walsender process that streams the WAL over Copy messages. */
static void
WalSndLoop(WalSndSendDataCallback send_data)
{
/*
* Loop until we reach the end of this timeline or the client requests to
* stop streaming.
*/
for (;;)
{
if (am_async_walsender && there_are_sync_standbys)
{
XLogRecPtr SendRqstLSN;
XLogRecPtr SyncFlushLSN;SendRqstLSN = GetFlushRecPtr(NULL);
LWLockAcquire(SyncRepLock, LW_SHARED);
SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);if (SendRqstLSN > SyncFlushLSN)
continue;
}
Not quite. Instead of "continue", I would set SendRqstLSN to SyncFlushLSN
so that the WAL sender only sends up to the current synchronously
replicated LSN. TBH there are probably other things that need to be
considered (e.g., how do we ensure that the WAL sender sends the rest once
it is replicated?), but I still think we should avoid spinning in the WAL
sender waiting for WAL to be replicated.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
(Now I understand what "async" mean here..)
At Mon, 28 Feb 2022 22:05:28 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
On Tue, Mar 01, 2022 at 11:10:09AM +0530, Bharath Rupireddy wrote:
On Tue, Mar 1, 2022 at 12:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
My feedback is specifically about this behavior. I don't think we should
spin in XLogSend*() waiting for an LSN to be synchronously replicated. I
think we should just choose the SendRqstPtr based on what is currently
synchronously replicated.Do you mean something like the following?
/* Main loop of walsender process that streams the WAL over Copy messages. */
static void
WalSndLoop(WalSndSendDataCallback send_data)
{
/*
* Loop until we reach the end of this timeline or the client requests to
* stop streaming.
*/
for (;;)
{
if (am_async_walsender && there_are_sync_standbys)
{
XLogRecPtr SendRqstLSN;
XLogRecPtr SyncFlushLSN;SendRqstLSN = GetFlushRecPtr(NULL);
LWLockAcquire(SyncRepLock, LW_SHARED);
SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);if (SendRqstLSN > SyncFlushLSN)
continue;
}
The current trend is energy-savings. We never add a "wait for some
fixed time then exit if the condition makes, otherwise repeat" loop
for this kind of purpose where there's no guarantee that the loop
exits quite shortly. Concretely we ought to rely on condition
variables to do that.
Not quite. Instead of "continue", I would set SendRqstLSN to SyncFlushLSN
so that the WAL sender only sends up to the current synchronously
I'm not sure, but doesn't that makes walsender falsely believes it
have caught up to the bleeding edge of WAL?
replicated LSN. TBH there are probably other things that need to be
considered (e.g., how do we ensure that the WAL sender sends the rest once
it is replicated?), but I still think we should avoid spinning in the WAL
sender waiting for WAL to be replicated.
It seems to me it would be something similar to
SyncRepReleaseWaiters(). Or it could be possible to consolidate this
feature into the function, I'm not sure, though.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Mar 01, 2022 at 04:34:31PM +0900, Kyotaro Horiguchi wrote:
At Mon, 28 Feb 2022 22:05:28 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
replicated LSN. TBH there are probably other things that need to be
considered (e.g., how do we ensure that the WAL sender sends the rest once
it is replicated?), but I still think we should avoid spinning in the WAL
sender waiting for WAL to be replicated.It seems to me it would be something similar to
SyncRepReleaseWaiters(). Or it could be possible to consolidate this
feature into the function, I'm not sure, though.
Yes, perhaps the synchronous replication framework will need to alert WAL
senders when the syncrep LSN advances so that the WAL is sent to the async
standbys. I'm glossing over the details, but I think that should be the
general direction.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Mar 1, 2022 at 10:35 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Mar 01, 2022 at 04:34:31PM +0900, Kyotaro Horiguchi wrote:
At Mon, 28 Feb 2022 22:05:28 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
replicated LSN. TBH there are probably other things that need to be
considered (e.g., how do we ensure that the WAL sender sends the rest once
it is replicated?), but I still think we should avoid spinning in the WAL
sender waiting for WAL to be replicated.It seems to me it would be something similar to
SyncRepReleaseWaiters(). Or it could be possible to consolidate this
feature into the function, I'm not sure, though.Yes, perhaps the synchronous replication framework will need to alert WAL
senders when the syncrep LSN advances so that the WAL is sent to the async
standbys. I'm glossing over the details, but I think that should be the
general direction.
It's doable. But we can't avoid async walsenders waiting for the flush
LSN even if we take the SyncRepReleaseWaiters() approach right? I'm
not sure (at this moment) what's the biggest advantage of this
approach i.e. (1) backends waking up walsenders after flush lsn is
updated vs (2) walsenders keep looking for the new flush lsn.
My feedback is specifically about this behavior. I don't think we should
spin in XLogSend*() waiting for an LSN to be synchronously replicated. I
think we should just choose the SendRqstPtr based on what is currently
synchronously replicated.Do you mean something like the following?
/* Main loop of walsender process that streams the WAL over Copy messages. */
static void
WalSndLoop(WalSndSendDataCallback send_data)
{
/*
* Loop until we reach the end of this timeline or the client requests to
* stop streaming.
*/
for (;;)
{
if (am_async_walsender && there_are_sync_standbys)
{
XLogRecPtr SendRqstLSN;
XLogRecPtr SyncFlushLSN;SendRqstLSN = GetFlushRecPtr(NULL);
LWLockAcquire(SyncRepLock, LW_SHARED);
SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);if (SendRqstLSN > SyncFlushLSN)
continue;
}Not quite. Instead of "continue", I would set SendRqstLSN to SyncFlushLSN
so that the WAL sender only sends up to the current synchronously
replicated LSN. TBH there are probably other things that need to be
considered (e.g., how do we ensure that the WAL sender sends the rest once
it is replicated?), but I still think we should avoid spinning in the WAL
sender waiting for WAL to be replicated.
I did some more analysis on the above point: we can let
XLogSendPhysical know up to which it can send WAL (SendRqstLSN). But,
XLogSendLogical reads the WAL using XLogReadRecord mechanism with
read_local_xlog_page page_read callback to which we can't really say
SendRqstLSN. May be we have to do something like below:
XLogSendPhysical:
/* Figure out how far we can safely send the WAL. */
if (am_async_walsender && there_are_sync_standbys)
{
LWLockAcquire(SyncRepLock, LW_SHARED);
SendRqstPtr = WalSndCtl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);
}
/* Existing code path to determine SendRqstPtr */
else if (sendTimeLineIsHistoric)
{
}
else if (am_cascading_walsender)
{
}
else
{
/*
* Streaming the current timeline on a primary.
}
XLogSendLogical:
if (am_async_walsender && there_are_sync_standbys)
{
XLogRecPtr SendRqstLSN;
XLogRecPtr SyncFlushLSN;
SendRqstLSN = GetFlushRecPtr(NULL);
LWLockAcquire(SyncRepLock, LW_SHARED);
SyncFlushLSN = WalSndCtl->lsn[SYNC_REP_WAIT_FLUSH];
LWLockRelease(SyncRepLock);
if (SendRqstLSN > SyncFlushLSN)
return;
}
On Tue, Mar 1, 2022 at 7:35 AM Hsu, John <hsuchen@amazon.com> wrote:
I too observed this once or twice. It looks like the walsender isn't
detecting postmaster death in for (;;) with WalSndWait. Not sure if
this is expected or true with other wait-loops in walsender code. Any
more thoughts here?Unfortunately I haven't had a chance to dig into it more although iirc I hit it fairly often.
I think I got what the issue is. Below does the trick.
if (got_STOPPING)
proc_exit(0);
* If the server is shut down, checkpointer sends us
* PROCSIG_WALSND_INIT_STOPPING after all regular backends have exited.
I will take care of this in the next patch once the approach we take
for this feature gets finalized.
Regards,
Bharath Rupireddy.
On Tue, Mar 01, 2022 at 11:09:57PM +0530, Bharath Rupireddy wrote:
On Tue, Mar 1, 2022 at 10:35 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
Yes, perhaps the synchronous replication framework will need to alert WAL
senders when the syncrep LSN advances so that the WAL is sent to the async
standbys. I'm glossing over the details, but I think that should be the
general direction.It's doable. But we can't avoid async walsenders waiting for the flush
LSN even if we take the SyncRepReleaseWaiters() approach right? I'm
not sure (at this moment) what's the biggest advantage of this
approach i.e. (1) backends waking up walsenders after flush lsn is
updated vs (2) walsenders keep looking for the new flush lsn.
I think there are a couple of advantages. For one, spinning is probably
not the best from a resource perspective. There is no guarantee that the
desired SendRqstPtr will ever be synchronously replicated, in which case
the WAL sender would spin forever. Also, this approach might fit in better
with the existing synchronous replication framework. When a WAL sender
realizes that it can't send up to the current "flush" LSN because it's not
synchronously replicated, it will request to be alerted when it is. In the
meantime, it can send up to the latest syncrep LSN so that the async
standby is as up-to-date as possible.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Mar 2, 2022 at 2:57 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Mar 01, 2022 at 11:09:57PM +0530, Bharath Rupireddy wrote:
On Tue, Mar 1, 2022 at 10:35 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
Yes, perhaps the synchronous replication framework will need to alert WAL
senders when the syncrep LSN advances so that the WAL is sent to the async
standbys. I'm glossing over the details, but I think that should be the
general direction.It's doable. But we can't avoid async walsenders waiting for the flush
LSN even if we take the SyncRepReleaseWaiters() approach right? I'm
not sure (at this moment) what's the biggest advantage of this
approach i.e. (1) backends waking up walsenders after flush lsn is
updated vs (2) walsenders keep looking for the new flush lsn.I think there are a couple of advantages. For one, spinning is probably
not the best from a resource perspective.
Just to be on the same page - by spinning do you mean - the async
walsender waiting for the sync flushLSN in a for-loop with
WaitLatch()?
There is no guarantee that the
desired SendRqstPtr will ever be synchronously replicated, in which case
the WAL sender would spin forever.
The async walsenders will not exactly wait for SendRqstPtr LSN to be
the flush lsn. Say, SendRqstPtr is 100 and the current sync FlushLSN
is 95, they will have to wait until FlushLSN moves ahead of
SendRqstPtr i.e. SendRqstPtr <= FlushLSN. I can't think of a scenario
(right now) that doesn't move the sync FlushLSN at all. If there's
such a scenario, shouldn't it be treated as a sync replication bug?
Also, this approach might fit in better
with the existing synchronous replication framework. When a WAL sender
realizes that it can't send up to the current "flush" LSN because it's not
synchronously replicated, it will request to be alerted when it is.
I think you are referring to the way a backend calls SyncRepWaitForLSN
and waits until any one of the walsender sets syncRepState to
SYNC_REP_WAIT_COMPLETE in SyncRepWakeQueue. Firstly, SyncRepWaitForLSN
blocking i.e. the backend spins/waits in for (;;) loop until its
syncRepState becomes SYNC_REP_WAIT_COMPLETE. The backend doesn't do
any other work but waits. So, spinning isn't avoided completely.
Unless, I'm missing something, the existing syc repl queue
(SyncRepQueue) mechanism doesn't avoid spinning in the requestors
(backends) SyncRepWaitForLSN or in the walsenders SyncRepWakeQueue.
In the
meantime, it can send up to the latest syncrep LSN so that the async
standby is as up-to-date as possible.
Just to be clear, there can exist the following scenarios:
Firstly, SendRqstPtr is up to which a walsender can send WAL, it's not the
scenario 1:
async SendRqstPtr is 100, sync FlushLSN is 95 - async standbys will
wait until the FlushLSN moves ahead, once SendRqstPtr <= FlushLSN, it
sends out the WAL.
scenario 2:
async SendRqstPtr is 105, sync FlushLSN is 110 - async standbys will
not wait, it just sends out the WAL up to SendRqstPtr i.e. LSN 105.
scenario 3, same as scenario 2 but SendRqstPtr and FlushLSN is same:
async SendRqstPtr is 105, sync FlushLSN is 105 - async standbys will
not wait, it just sends out the WAL up to SendRqstPtr i.e. LSN 105.
This way, the async standbys are always as up-to-date as possible with
the sync FlushLSN.
Are you referring to any other scenarios?
Regards,
Bharath Rupireddy.
On Wed, Mar 02, 2022 at 09:47:09AM +0530, Bharath Rupireddy wrote:
On Wed, Mar 2, 2022 at 2:57 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
I think there are a couple of advantages. For one, spinning is probably
not the best from a resource perspective.Just to be on the same page - by spinning do you mean - the async
walsender waiting for the sync flushLSN in a for-loop with
WaitLatch()?
Yes.
Also, this approach might fit in better
with the existing synchronous replication framework. When a WAL sender
realizes that it can't send up to the current "flush" LSN because it's not
synchronously replicated, it will request to be alerted when it is.I think you are referring to the way a backend calls SyncRepWaitForLSN
and waits until any one of the walsender sets syncRepState to
SYNC_REP_WAIT_COMPLETE in SyncRepWakeQueue. Firstly, SyncRepWaitForLSN
blocking i.e. the backend spins/waits in for (;;) loop until its
syncRepState becomes SYNC_REP_WAIT_COMPLETE. The backend doesn't do
any other work but waits. So, spinning isn't avoided completely.Unless, I'm missing something, the existing syc repl queue
(SyncRepQueue) mechanism doesn't avoid spinning in the requestors
(backends) SyncRepWaitForLSN or in the walsenders SyncRepWakeQueue.
My point is that there are existing tools for alerting processes when an
LSN is synchronously replicated and for waking up WAL senders. What I am
proposing wouldn't involve spinning in XLogSendPhysical() waiting for
synchronous replication. Like SyncRepWaitForLSN(), we'd register our LSN
in the queue (SyncRepQueueInsert()), but we wouldn't sit in a separate loop
waiting to be woken. Instead, SyncRepWakeQueue() would eventually wake up
the WAL sender and trigger another iteration of WalSndLoop().
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sat, Mar 5, 2022 at 1:26 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Mar 02, 2022 at 09:47:09AM +0530, Bharath Rupireddy wrote:
On Wed, Mar 2, 2022 at 2:57 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
I think there are a couple of advantages. For one, spinning is probably
not the best from a resource perspective.Just to be on the same page - by spinning do you mean - the async
walsender waiting for the sync flushLSN in a for-loop with
WaitLatch()?Yes.
Also, this approach might fit in better
with the existing synchronous replication framework. When a WAL sender
realizes that it can't send up to the current "flush" LSN because it's not
synchronously replicated, it will request to be alerted when it is.I think you are referring to the way a backend calls SyncRepWaitForLSN
and waits until any one of the walsender sets syncRepState to
SYNC_REP_WAIT_COMPLETE in SyncRepWakeQueue. Firstly, SyncRepWaitForLSN
blocking i.e. the backend spins/waits in for (;;) loop until its
syncRepState becomes SYNC_REP_WAIT_COMPLETE. The backend doesn't do
any other work but waits. So, spinning isn't avoided completely.Unless, I'm missing something, the existing syc repl queue
(SyncRepQueue) mechanism doesn't avoid spinning in the requestors
(backends) SyncRepWaitForLSN or in the walsenders SyncRepWakeQueue.My point is that there are existing tools for alerting processes when an
LSN is synchronously replicated and for waking up WAL senders. What I am
proposing wouldn't involve spinning in XLogSendPhysical() waiting for
synchronous replication. Like SyncRepWaitForLSN(), we'd register our LSN
in the queue (SyncRepQueueInsert()), but we wouldn't sit in a separate loop
waiting to be woken. Instead, SyncRepWakeQueue() would eventually wake up
the WAL sender and trigger another iteration of WalSndLoop().
I understand. Even if we use the SyncRepWaitForLSN approach, the async
walsenders will have to do nothing in WalSndLoop() until the sync
walsender wakes them up via SyncRepWakeQueue. For sure, the
SyncRepWaitForLSN approach avoids extra looping and makes the code
look better. One concern is that increased burden on SyncRepLock the
SyncRepWaitForLSN approach will need to take
(LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);), now that the async
walsenders will get added to the list of backends that contened for
SyncRepLock. Whereas the other approach that I earlier proposed would
require SyncRepLock shared mode as it just needs to read the flushLSN.
I'm not sure if it's a bigger problem.
Having said above, I agree that the SyncRepWaitForLSN approach makes
things probably easy and avoids the new wait loops.
Let me think more and work on this approach.
Regards,
Bharath Rupireddy.
Hi,
On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote:
I understand. Even if we use the SyncRepWaitForLSN approach, the async
walsenders will have to do nothing in WalSndLoop() until the sync
walsender wakes them up via SyncRepWakeQueue.
I still think we should flat out reject this approach. The proper way to
implement this feature is to change the protocol so that WAL can be sent to
replicas with an additional LSN informing them up to where WAL can be
flushed. That way WAL is already sent when the sync replicas have acknowledged
receipt and just an updated "flush/apply up to here" LSN has to be sent.
- Andres
On Sun, Mar 6, 2022 at 1:57 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote:
I understand. Even if we use the SyncRepWaitForLSN approach, the async
walsenders will have to do nothing in WalSndLoop() until the sync
walsender wakes them up via SyncRepWakeQueue.I still think we should flat out reject this approach. The proper way to
implement this feature is to change the protocol so that WAL can be sent to
replicas with an additional LSN informing them up to where WAL can be
flushed. That way WAL is already sent when the sync replicas have acknowledged
receipt and just an updated "flush/apply up to here" LSN has to be sent.
I was having this thought back of my mind. Please help me understand these:
1) How will the async standbys ignore the WAL received but
not-yet-flushed by them in case the sync standbys don't acknowledge
flush LSN back to the primary for whatever reasons?
2) When we say the async standbys will receive the WAL, will they just
keep the received WAL in the shared memory but not apply or will they
just write but not apply the WAL and flush the WAL to the pg_wal
directory on the disk or will they write to some other temp wal
directory until they receive go-ahead LSN from the primary?
3) Won't the network transfer cost be wasted in case the sync standbys
don't acknowledge flush LSN back to the primary for whatever reasons?
The proposed idea in this thread (async standbys waiting for flush LSN
from sync standbys before sending the WAL), although it makes async
standby slower in receiving the WAL, it doesn't have the above
problems and is simpler to implement IMO. Since this feature is going
to be optional with a GUC, users can enable it based on the needs.
Regards,
Bharath Rupireddy.
Hi,
On 3/5/22 10:57 PM, Bharath Rupireddy wrote:
On Sun, Mar 6, 2022 at 1:57 AM Andres Freund<andres@anarazel.de> wrote:
Hi,
On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote:
I understand. Even if we use the SyncRepWaitForLSN approach, the async
walsenders will have to do nothing in WalSndLoop() until the sync
walsender wakes them up via SyncRepWakeQueue.I still think we should flat out reject this approach. The proper way to
implement this feature is to change the protocol so that WAL can be sent to
replicas with an additional LSN informing them up to where WAL can be
flushed. That way WAL is already sent when the sync replicas have acknowledged
receipt and just an updated "flush/apply up to here" LSN has to be sent.I was having this thought back of my mind. Please help me understand these:
1) How will the async standbys ignore the WAL received but
not-yet-flushed by them in case the sync standbys don't acknowledge
flush LSN back to the primary for whatever reasons?
2) When we say the async standbys will receive the WAL, will they just
keep the received WAL in the shared memory but not apply or will they
just write but not apply the WAL and flush the WAL to the pg_wal
directory on the disk or will they write to some other temp wal
directory until they receive go-ahead LSN from the primary?
3) Won't the network transfer cost be wasted in case the sync standbys
don't acknowledge flush LSN back to the primary for whatever reasons?The proposed idea in this thread (async standbys waiting for flush LSN
from sync standbys before sending the WAL), although it makes async
standby slower in receiving the WAL, it doesn't have the above
problems and is simpler to implement IMO. Since this feature is going
to be optional with a GUC, users can enable it based on the needs.
I think another downside of the approach would be if the async-replica
had a lot of changes that were unacknowledged and it were to be
restarted for whatever reason we might need to recreate the replica, or
run pg_rewind from it again which seems to be what we're trying to avoid.
It also pushes the complexity to the client side for consumers who stream
changes from logical slots which the current proposal seems to prevent.
Hi,
On 2022-03-06 12:27:52 +0530, Bharath Rupireddy wrote:
On Sun, Mar 6, 2022 at 1:57 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote:
I understand. Even if we use the SyncRepWaitForLSN approach, the async
walsenders will have to do nothing in WalSndLoop() until the sync
walsender wakes them up via SyncRepWakeQueue.I still think we should flat out reject this approach. The proper way to
implement this feature is to change the protocol so that WAL can be sent to
replicas with an additional LSN informing them up to where WAL can be
flushed. That way WAL is already sent when the sync replicas have acknowledged
receipt and just an updated "flush/apply up to here" LSN has to be sent.I was having this thought back of my mind. Please help me understand these:
1) How will the async standbys ignore the WAL received but
not-yet-flushed by them in case the sync standbys don't acknowledge
flush LSN back to the primary for whatever reasons?
What do you mean with "ignore"? When replaying?
I think this'd require adding a new pg_control field saying up to which LSN
WAL is "valid". If that field is set, replay would only replay up to that LSN
unless some explicit operation is taken to replay further (e.g. for data
recovery).
2) When we say the async standbys will receive the WAL, will they just
keep the received WAL in the shared memory but not apply or will they
just write but not apply the WAL and flush the WAL to the pg_wal
directory on the disk or will they write to some other temp wal
directory until they receive go-ahead LSN from the primary?
I was thinking that for now it'd go to disk, but eventually would first go to
wal_buffers and only to disk if wal_buffers needs to be flushed out (and only
in that case the pg_control field would need to be set).
3) Won't the network transfer cost be wasted in case the sync standbys
don't acknowledge flush LSN back to the primary for whatever reasons?
That should be *extremely* rare, and in that case a bit of wasted traffic
isn't going to matter.
The proposed idea in this thread (async standbys waiting for flush LSN
from sync standbys before sending the WAL), although it makes async
standby slower in receiving the WAL, it doesn't have the above
problems and is simpler to implement IMO. Since this feature is going
to be optional with a GUC, users can enable it based on the needs.
To me it's architecturally the completely wrong direction. We should move in
the *other* direction, i.e. allow WAL to be sent to standbys before the
primary has finished flushing it locally. Which requires similar
infrastructure to what we're discussing here.
Greetings,
Andres Freund
On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
To me it's architecturally the completely wrong direction. We should move in
the *other* direction, i.e. allow WAL to be sent to standbys before the
primary has finished flushing it locally. Which requires similar
infrastructure to what we're discussing here.
I think this is a good point. After all, WALRead() has the following
comment:
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
Once you have all the infrastructure for that, holding back WAL replay on
async standbys based on synchronous replication might be relatively easy.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
To me it's architecturally the completely wrong direction. We should move in
the *other* direction, i.e. allow WAL to be sent to standbys before the
primary has finished flushing it locally. Which requires similar
infrastructure to what we're discussing here.I think this is a good point. After all, WALRead() has the following
comment:* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.Once you have all the infrastructure for that, holding back WAL replay on
async standbys based on synchronous replication might be relatively easy.
That is, (as my understanding) async standbys are required to allow
overwriting existing unreplayed records after reconnection. But,
putting aside how to remember that LSN, if that happens at a segment
boundary, the async replica may run into the similar situation with
the missing-contrecord case. But standby cannot insert any original
record to get out from that situation.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 14 Mar 2022 11:30:02 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in
On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
To me it's architecturally the completely wrong direction. We should move in
the *other* direction, i.e. allow WAL to be sent to standbys before the
primary has finished flushing it locally. Which requires similar
infrastructure to what we're discussing here.I think this is a good point. After all, WALRead() has the following
comment:* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.Once you have all the infrastructure for that, holding back WAL replay on
async standbys based on synchronous replication might be relatively easy.
Just to make sure and a bit off from the topic, I think the
optimization itself is quite promising and want to have.
That is, (as my understanding) async standbys are required to allow
overwriting existing unreplayed records after reconnection. But,
putting aside how to remember that LSN, if that happens at a segment
boundary, the async replica may run into the similar situation with
the missing-contrecord case. But standby cannot insert any original
record to get out from that situation.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi,
On 2022-03-14 11:30:02 +0900, Kyotaro Horiguchi wrote:
That is, (as my understanding) async standbys are required to allow
overwriting existing unreplayed records after reconnection. But,
putting aside how to remember that LSN, if that happens at a segment
boundary, the async replica may run into the similar situation with
the missing-contrecord case. But standby cannot insert any original
record to get out from that situation.
I do not see how that problem arrises on standbys when they aren't allowed to
read those records. It'll just wait for more data to arrive.
Greetings,
Andres Freund
On Wed, Mar 9, 2022 at 7:31 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-03-06 12:27:52 +0530, Bharath Rupireddy wrote:
On Sun, Mar 6, 2022 at 1:57 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-03-05 14:14:54 +0530, Bharath Rupireddy wrote:
I understand. Even if we use the SyncRepWaitForLSN approach, the async
walsenders will have to do nothing in WalSndLoop() until the sync
walsender wakes them up via SyncRepWakeQueue.I still think we should flat out reject this approach. The proper way to
implement this feature is to change the protocol so that WAL can be sent to
replicas with an additional LSN informing them up to where WAL can be
flushed. That way WAL is already sent when the sync replicas have acknowledged
receipt and just an updated "flush/apply up to here" LSN has to be sent.I was having this thought back of my mind. Please help me understand these:
1) How will the async standbys ignore the WAL received but
not-yet-flushed by them in case the sync standbys don't acknowledge
flush LSN back to the primary for whatever reasons?What do you mean with "ignore"? When replaying?
Let me illustrate with an example:
1) Say, primary at LSN 100, sync standby at LSN 90 (is about to
receive/receiving the WAL from LSN 91 - 100 from primary), async
standby at LSN 100 - today this is possible if the async standby is
closer to primary than sync standby for whatever reasons
2) With the approach that's originally proposed in this thread - async
standbys can never get ahead of LSN 90 (flush LSN reported back to the
primary by all sync standbys)
3) With the approach that's suggested i.e. "let async standbys receive
WAL at their own pace, but they should only be allowed to
apply/write/flush to the WAL file in pg_wal directory/disk until the
sync standbys latest flush LSN" - async standbys can receive the WAL
from LSN 91 - 100 but they aren't allowed to apply/write/flush. Where
will the async standbys hold the WAL from LSN 91 - 100 until the
latest flush LSN (100) is reported to them? If they "somehow" store
the WAL from LSN 91 - 100 and not apply/write/flush, how will they
ignore that WAL, say if the sync standbys don't report the latest
flush LSN back to the primary (for whatever reasons)? In such cases,
the primary has no idea of the latest sync standbys flush LSN (?) if
at all the sync standbys can't come up and reconnect and resync with
the primary? Should the async standby always assume that the WAL from
LSN 91 -100 is invalid for them as they haven't received the sync
flush LSN from primary? In such a case, aren't there "invalid holes"
in the WAL files on the async standbys?
I think this'd require adding a new pg_control field saying up to which LSN
WAL is "valid". If that field is set, replay would only replay up to that LSN
unless some explicit operation is taken to replay further (e.g. for data
recovery).
With the approach that's suggested i.e. "let async standbys receive
WAL at their own pace, but they should only be allowed to
apply/write/flush to the WAL file in pg_wal directory/disk until the
sync standbys latest flush LSN'' - there can be 2 parts to the WAL on
async standbys - most of it "valid and makes sense for async standbys"
and some of it "invalid and doesn't make sense for async standbys''?
Can't this require us to rework some parts like "redo/apply/recovery
logic on async standbys'', tools like pg_basebackup, pg_rewind,
pg_receivewal, pg_recvlogical, cascading replication etc. that depend
on WAL records and now should know whether the WAL records are valid
for them? I may be wrong here though.
2) When we say the async standbys will receive the WAL, will they just
keep the received WAL in the shared memory but not apply or will they
just write but not apply the WAL and flush the WAL to the pg_wal
directory on the disk or will they write to some other temp wal
directory until they receive go-ahead LSN from the primary?I was thinking that for now it'd go to disk, but eventually would first go to
wal_buffers and only to disk if wal_buffers needs to be flushed out (and only
in that case the pg_control field would need to be set).
IIUC, the WAL buffers (XLogCtl->pages) aren't used on standbys as wal
receivers bypass them and flush the data directly to the disk. Hence,
the WAL buffers that are allocated(?, I haven't checked the code
though) but unused on standbys can be used to hold the WAL until the
new flush LSN is reported from the primary. At any point of time, the
WAL buffers will have the latest WAL that's waiting for a new flush
LSN from the primary. However, this can be a problem for larger
transactions that can eat up the entire WAL buffers and flush LSN is
far behind in which case we need to flush the WAL to the latest WAL
file in pg_wal/disk but let the other folks in the server know upto
which the WAL is valid.
3) Won't the network transfer cost be wasted in case the sync standbys
don't acknowledge flush LSN back to the primary for whatever reasons?That should be *extremely* rare, and in that case a bit of wasted traffic
isn't going to matter.
Agree.
The proposed idea in this thread (async standbys waiting for flush LSN
from sync standbys before sending the WAL), although it makes async
standby slower in receiving the WAL, it doesn't have the above
problems and is simpler to implement IMO. Since this feature is going
to be optional with a GUC, users can enable it based on the needs.To me it's architecturally the completely wrong direction. We should move in
the *other* direction, i.e. allow WAL to be sent to standbys before the
primary has finished flushing it locally. Which requires similar
infrastructure to what we're discussing here.
Agree.
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
Like others pointed out, if done above, it's possible to achieve
"allow WAL to be sent to standbys before the primary has finished
flushing it locally".
I would like to hear more thoughts and then summarize the design
points a bit later.
Regards,
Bharath Rupireddy.
On Sat, Mar 5, 2022 at 1:26 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
My point is that there are existing tools for alerting processes when an
LSN is synchronously replicated and for waking up WAL senders. What I am
proposing wouldn't involve spinning in XLogSendPhysical() waiting for
synchronous replication. Like SyncRepWaitForLSN(), we'd register our LSN
in the queue (SyncRepQueueInsert()), but we wouldn't sit in a separate loop
waiting to be woken. Instead, SyncRepWakeQueue() would eventually wake up
the WAL sender and trigger another iteration of WalSndLoop().
While we continue to discuss the other better design at [1]/messages/by-id/CALj2ACWCj60g6TzYMbEO07ZhnBGbdCveCrD413udqbRM0O59RA@mail.gmail.com, FWIW, I
would like to share a simpler patch that lets wal senders serving
async standbys wait until sync standbys report the flush lsn.
Obviously this is not an elegant way to solve the problem reported in
this thread, as I have this patch ready long back, I wanted to share
it here.
Nathan, of course, this is not something you wanted.
[1]: /messages/by-id/CALj2ACWCj60g6TzYMbEO07ZhnBGbdCveCrD413udqbRM0O59RA@mail.gmail.com
Regards,
Bharath Rupireddy.
Attachments:
v2-0001-Allow-async-standbys-wait-for-sync-replication.patchapplication/x-patch; name=v2-0001-Allow-async-standbys-wait-for-sync-replication.patchDownload
From 8a74dbd8c3fe3631b6a2ecee3d8c7a1c98429341 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 17 Mar 2022 17:01:53 +0000
Subject: [PATCH v2] Allow async standbys wait for sync replication
---
doc/src/sgml/config.sgml | 22 +++
src/backend/replication/syncrep.c | 11 +-
src/backend/replication/walsender.c | 157 ++++++++++++++++++
src/backend/utils/misc/guc.c | 9 +
src/backend/utils/misc/postgresql.conf.sample | 3 +
src/include/replication/syncrep.h | 7 +
src/include/replication/walsender.h | 1 +
7 files changed, 202 insertions(+), 8 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a48973b3c..026e12f5ef 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4247,6 +4247,28 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</listitem>
</varlistentry>
+ <varlistentry id="guc-async-standbys-wait-for-sync_replication" xreflabel="async_standbys_wait_for_sync_replication">
+ <term><varname>async_standbys_wait_for_sync_replication</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>async_standbys_wait_for_sync_replication</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When set, asynchronous standbys will wait until synchronous standbys
+ (that are either in quorum or priority based) receive WAL, flush and
+ acknowledge primary with the flush LSN. This behvaiour is particularly
+ important in the event of failover as it avoids manual steps required
+ on the asynchronous standbys that might go ahead of the synchronous
+ standbys.
+ </para>
+ <para>
+ This parameter can only be set in the <filename>postgresql.conf</filename>
+ file or on the server command line.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ce163b99e9..398c11579c 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -97,8 +97,6 @@ static bool announce_next_takeover = true;
SyncRepConfigData *SyncRepConfig = NULL;
static int SyncRepWaitMode = SYNC_REP_NO_WAIT;
-static void SyncRepQueueInsert(int mode);
-static void SyncRepCancelWait(void);
static int SyncRepWakeQueue(bool all, int mode);
static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
@@ -120,9 +118,6 @@ static int SyncRepGetStandbyPriority(void);
static int standby_priority_comparator(const void *a, const void *b);
static int cmp_lsn(const void *a, const void *b);
-#ifdef USE_ASSERT_CHECKING
-static bool SyncRepQueueIsOrderedByLSN(int mode);
-#endif
/*
* ===========================================================
@@ -335,7 +330,7 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
* Usually we will go at tail of queue, though it's possible that we arrive
* here out of order, so start at tail and work back to insertion point.
*/
-static void
+void
SyncRepQueueInsert(int mode)
{
PGPROC *proc;
@@ -368,7 +363,7 @@ SyncRepQueueInsert(int mode)
/*
* Acquire SyncRepLock and cancel any wait currently in progress.
*/
-static void
+void
SyncRepCancelWait(void)
{
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
@@ -979,7 +974,7 @@ SyncRepUpdateSyncStandbysDefined(void)
}
#ifdef USE_ASSERT_CHECKING
-static bool
+bool
SyncRepQueueIsOrderedByLSN(int mode)
{
PGPROC *proc = NULL;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d0292a092..97d32a6294 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -125,6 +125,8 @@ int wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
* data message */
bool log_replication_commands = false;
+bool async_standbys_wait_for_sync_replication = true;
+
/*
* State for WalSndWakeupRequest
*/
@@ -258,6 +260,8 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
TimeLineID *tli_p);
+/* called by wal sender serving asynchronous standby */
+static void AsynWalSndWaitForSyncRepLSN(XLogRecPtr lsn);
/* Initialize walsender process before entering the main command loop */
void
@@ -2771,6 +2775,8 @@ XLogSendPhysical(void)
SendRqstPtr = GetFlushRecPtr(NULL);
}
+ AsynWalSndWaitForSyncRepLSN(SendRqstPtr);
+
/*
* Record the current system time as an approximation of the time at which
* this WAL location was written for the purposes of lag tracking.
@@ -2995,6 +3001,22 @@ XLogSendLogical(void)
if (record != NULL)
{
+ /*
+ * At this point, we do not know whether the current LSN (ReadRecPtr)
+ * is required by any of the logical decoding output plugins which is
+ * only known at the plugin level. If we were to decide whether to wait
+ * or not for the synchronous standbys flush LSN at the plugin level,
+ * we might have to pass extra information to it which doesn't sound an
+ * elegant way.
+ *
+ * Another way the output plugins can wait there before sending the WAL
+ * is by reading the flush LSN from the logical replication slots.
+ *
+ * Waiting here i.e. before even the logical decoding kicks in, makes
+ * the code clean.
+ */
+ AsynWalSndWaitForSyncRepLSN(logical_decoding_ctx->reader->ReadRecPtr);
+
/*
* Note the lack of any call to LagTrackerWrite() which is handled by
* WalSndUpdateProgress which is called by output plugin through
@@ -3789,3 +3811,138 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
Assert(time != 0);
return now - time;
}
+
+/*
+ * This function is similar to SyncRepWaitForLSN() in syncrep.c. Only
+ * difference is in the way it waits with WalSndWait and exits when
+ * got_STOPPING or got_SIGUSR2 is set.
+ */
+void
+AsynWalSndWaitForSyncRepLSN(XLogRecPtr lsn)
+{
+ int mode;
+
+ /*
+ * Fast exit in case we are told to not wait.
+ */
+ if (!async_standbys_wait_for_sync_replication)
+ return;
+
+ /*
+ * Fast exit in case the wal sender is serving synchronous standby at the
+ * moment as it has no business here.
+ */
+ if (MyWalSnd->sync_standby_priority > 0)
+ return;
+
+ /*
+ * Fast exit if user has not requested sync replication, or there are no
+ * sync replication standby names defined.
+ *
+ * Since this routine gets called every commit time, it's important to
+ * exit quickly if sync replication is not requested. So we check
+ * WalSndCtl->sync_standbys_defined flag without the lock and exit
+ * immediately if it's false. If it's true, we need to check it again
+ * later while holding the lock, to check the flag and operate the sync
+ * rep queue atomically. This is necessary to avoid the race condition
+ * described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
+ * it's false, the lock is not necessary because we don't touch the queue.
+ */
+ if (!SyncRepRequested() ||
+ !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+ return;
+
+ mode = SYNC_REP_WAIT_FLUSH;
+
+ Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+ Assert(WalSndCtl != NULL);
+
+ LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+
+ /*
+ * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
+ * set. See SyncRepUpdateSyncStandbysDefined.
+ *
+ * Also check that the standby hasn't already replied. Unlikely race
+ * condition but we'll be fetching that cache line anyway so it's likely
+ * to be a low cost check.
+ */
+ if (!WalSndCtl->sync_standbys_defined ||
+ lsn <= WalSndCtl->lsn[mode])
+ {
+ LWLockRelease(SyncRepLock);
+ return;
+ }
+
+ /*
+ * Set our waitLSN so WALSender will know when to wake us, and add
+ * ourselves to the queue.
+ */
+ MyProc->waitLSN = lsn;
+ MyProc->syncRepState = SYNC_REP_WAITING;
+ SyncRepQueueInsert(mode);
+ Assert(SyncRepQueueIsOrderedByLSN(mode));
+ LWLockRelease(SyncRepLock);
+
+ /*
+ * Wait for specified LSN to be confirmed.
+ *
+ * Each proc has its own wait latch, so we perform a normal latch
+ * check/wait loop here.
+ */
+ for (;;)
+ {
+ /* Must reset the latch before testing state. */
+ ResetLatch(MyLatch);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Process any requests or signals received recently */
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ SyncRepInitConfig();
+ }
+
+ if (!async_standbys_wait_for_sync_replication)
+ break;
+
+ if (MyWalSnd->sync_standby_priority > 0)
+ break;
+
+ /*
+ * Acquiring the lock is not needed, the latch ensures proper
+ * barriers. If it looks like we're done, we must really be done,
+ * because once walsender changes the state to SYNC_REP_WAIT_COMPLETE,
+ * it will never update it again, so we can't be seeing a stale value
+ * in that case.
+ */
+ if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
+ break;
+
+ /* Sleep until something happens or we time out */
+ WalSndWait(WL_SOCKET_WRITEABLE | WL_SOCKET_READABLE, -1,
+ WAIT_EVENT_SYNC_REP);
+
+ if (got_STOPPING || got_SIGUSR2)
+ {
+ SyncRepCancelWait();
+ break;
+ }
+ }
+
+ /*
+ * WalSender has checked our LSN and has removed us from queue. Clean up
+ * state and leave. It's OK to reset these shared memory fields without
+ * holding SyncRepLock, because any walsenders will ignore us anyway when
+ * we're not on the queue. We need a read barrier to make sure we see the
+ * changes to the queue link (this might be unnecessary without
+ * assertions, but better safe than sorry).
+ */
+ pg_read_barrier();
+ Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+ MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+ MyProc->waitLSN = 0;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e7f0a380e6..cd5f8b5a42 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1388,6 +1388,15 @@ static struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"async_standbys_wait_for_sync_replication", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets whether asynchronous standbys should wait until synchronous standbys receive and flush WAL."),
+ NULL
+ },
+ &async_standbys_wait_for_sync_replication,
+ true,
+ NULL, NULL, NULL
+ },
{
{"debug_assertions", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether the running server has assertion checks enabled."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cf5b26a36..9e6dc5082b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -308,6 +308,9 @@
#wal_sender_timeout = 60s # in milliseconds; 0 disables
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
+#async_standbys_wait_for_sync_replication = on # Specifies whether asynchronous
+ # standbys should wait until synchronous standbys receive and
+ # flush WAL
# - Primary Server -
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 27be230d77..b433528174 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -90,6 +90,13 @@ extern void SyncRepCleanupAtProcExit(void);
/* called by wal sender */
extern void SyncRepInitConfig(void);
extern void SyncRepReleaseWaiters(void);
+extern void SyncRepQueueInsert(int mode);
+extern void SyncRepCancelWait(void);
+
+#ifdef USE_ASSERT_CHECKING
+extern bool SyncRepQueueIsOrderedByLSN(int mode);
+#endif
+
/* called by wal sender and user backend */
extern int SyncRepGetCandidateStandbys(SyncRepStandbyData **standbys);
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index b1892e9e4b..61b113ae48 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -34,6 +34,7 @@ extern bool wake_wal_senders;
extern int max_wal_senders;
extern int wal_sender_timeout;
extern bool log_replication_commands;
+extern bool async_standbys_wait_for_sync_replication;
extern void InitWalSender(void);
extern bool exec_replication_command(const char *query_string);
--
2.25.1