Race condition in SyncRepGetSyncStandbysPriority

Started by Tom Laneabout 6 years ago38 messageshackers

tgl@sss.pgh.pa.us

about 6 years ago

Twice in the past month [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-02-29%2001%3A34%3A55[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-03-26%2013%3A51%3A15, buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.

What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.

The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what it's
already capturing. But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct? (Because the
alternative conclusion, if you reject the theory that a race is happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)

Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?

But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.

It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-02-29%2001%3A34%3A55
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-03-26%2013%3A51%3A15

Fujii Masao

masao.fujii@gmail.com

about 6 years ago

In reply to: Tom Lane (#1)

Re: Race condition in SyncRepGetSyncStandbysPriority

On 2020/03/27 10:26, Tom Lane wrote:

Twice in the past month [1][2], buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.

When I search the past discussions, I found that Noah Misch reported
the same issue.
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com

What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.

The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what it's
already capturing. But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct?

+1 to rewrite the function with better algorithm.

(Because the
alternative conclusion, if you reject the theory that a race is happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)

Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?

But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.

It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.

As far as I read the code, that assumption seems still valid. But the problem
is that each walsender updates MyWalSnd->sync_standby_priority at each
convenient timing, when SIGHUP is signaled. That is, at a certain moment,
some walsenders (also their WalSnd entries in shmem) work based on
the latest configuration but the others (also their WalSnd entries) work based
on the old one.

lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;

SyncRepGetSyncStandbysPriority() calculates the lowest priority among
all running walsenders as the above, by using the configuration info that
this walsender is based on. But this calculated lowest priority would be
invalid if other walsender is based on different (e.g., old) configuraiton.
This can cause the (other) walsender to have lower priority than
the calculated lowest priority and the second loop in
SyncRepGetSyncStandbysPriority() to unexpectedly end.

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Fujii Masao (#2)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Fri, 27 Mar 2020 13:54:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/03/27 10:26, Tom Lane wrote:

Twice in the past month [1][2], buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.

When I search the past discussions, I found that Noah Misch reported
the same issue.
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com

What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that
function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.
The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what
it's
already capturing. But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad
combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct?

+1 to rewrite the function with better algorithm.

(Because the
alternative conclusion, if you reject the theory that a race is
happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)
Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?
But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.
It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.

As far as I read the code, that assumption seems still valid. But the
problem
is that each walsender updates MyWalSnd->sync_standby_priority at each
convenient timing, when SIGHUP is signaled. That is, at a certain
moment,
some walsenders (also their WalSnd entries in shmem) work based on
the latest configuration but the others (also their WalSnd entries)
work based
on the old one.

lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;

SyncRepGetSyncStandbysPriority() calculates the lowest priority among
all running walsenders as the above, by using the configuration info
that
this walsender is based on. But this calculated lowest priority would
be
invalid if other walsender is based on different (e.g., old)
configuraiton.
This can cause the (other) walsender to have lower priority than
the calculated lowest priority and the second loop in
SyncRepGetSyncStandbysPriority() to unexpectedly end.

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of
SyncRepGetSyncStandbysPriority().

Or just ignore impossible priorities as non-sync standby. Anyway the
confused state is fixed after all walsenders have loaded the new
configuration.

I remember that I posted a bandaid for maybe the same issue.

/messages/by-id/20200207.125251.146972241588695685.horikyota.ntt@gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Masahiko Sawada

sawada.mshk@gmail.com

about 6 years ago

In reply to: Fujii Masao (#2)

Re: Race condition in SyncRepGetSyncStandbysPriority

On Fri, 27 Mar 2020 at 13:54, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/03/27 10:26, Tom Lane wrote:

Twice in the past month [1][2], buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.

When I search the past discussions, I found that Noah Misch reported
the same issue.
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com

What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.

The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what it's
already capturing. But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct?

+1 to rewrite the function with better algorithm.

(Because the
alternative conclusion, if you reject the theory that a race is happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)

Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?

But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.

It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.

As far as I read the code, that assumption seems still valid. But the problem
is that each walsender updates MyWalSnd->sync_standby_priority at each
convenient timing, when SIGHUP is signaled. That is, at a certain moment,
some walsenders (also their WalSnd entries in shmem) work based on
the latest configuration but the others (also their WalSnd entries) work based
on the old one.

lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;

SyncRepGetSyncStandbysPriority() calculates the lowest priority among
all running walsenders as the above, by using the configuration info that
this walsender is based on. But this calculated lowest priority would be
invalid if other walsender is based on different (e.g., old) configuraiton.
This can cause the (other) walsender to have lower priority than
the calculated lowest priority and the second loop in
SyncRepGetSyncStandbysPriority() to unexpectedly end.

I have the same understanding. Since sync_standby_priroity is
protected by SyncRepLock these values of each walsender are not
changed through two loops in SyncRepGetSyncStandbysPriority().
However, as Fujii-san already mentioned the true lowest priority can
be lower than lowest_priority, nmembers, when only part of walsenders
reloaded the configuration, which in turn could be the cause of
leaving entries in the pending list at the end of the function.

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

I think we can use max_wal_senders. And we can change the second loop
so that we exit from the function if the pending list is empty.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Masahiko Sawada

sawada.mshk@gmail.com

about 6 years ago

In reply to: Masahiko Sawada (#4)

Re: Race condition in SyncRepGetSyncStandbysPriority

On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 27 Mar 2020 at 13:54, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/03/27 10:26, Tom Lane wrote:

Twice in the past month [1][2], buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.

When I search the past discussions, I found that Noah Misch reported
the same issue.
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com

What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.

The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what it's
already capturing. But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct?

+1 to rewrite the function with better algorithm.

(Because the
alternative conclusion, if you reject the theory that a race is happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)

Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?

But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.

It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.

As far as I read the code, that assumption seems still valid. But the problem
is that each walsender updates MyWalSnd->sync_standby_priority at each
convenient timing, when SIGHUP is signaled. That is, at a certain moment,
some walsenders (also their WalSnd entries in shmem) work based on
the latest configuration but the others (also their WalSnd entries) work based
on the old one.

lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;

SyncRepGetSyncStandbysPriority() calculates the lowest priority among
all running walsenders as the above, by using the configuration info that
this walsender is based on. But this calculated lowest priority would be
invalid if other walsender is based on different (e.g., old) configuraiton.
This can cause the (other) walsender to have lower priority than
the calculated lowest priority and the second loop in
SyncRepGetSyncStandbysPriority() to unexpectedly end.

I have the same understanding. Since sync_standby_priroity is
protected by SyncRepLock these values of each walsender are not
changed through two loops in SyncRepGetSyncStandbysPriority().
However, as Fujii-san already mentioned the true lowest priority can
be lower than lowest_priority, nmembers, when only part of walsenders
reloaded the configuration, which in turn could be the cause of
leaving entries in the pending list at the end of the function.

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

I think we can use max_wal_senders.

Sorry, that's not true. We need another number large enough.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Masahiko Sawada (#5)

Re: Race condition in SyncRepGetSyncStandbysPriority

Masahiko Sawada <masahiko.sawada@2ndquadrant.com> writes:

On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

I think we can use max_wal_senders.

Sorry, that's not true. We need another number large enough.

The buildfarm had another three failures of this type today, so that
motivated me to look at it some more. I don't think this code needs
a band-aid fix; I think "nuke from orbit" is more nearly the right
level of response.

The point that I was trying to make originally is that it seems quite
insane to imagine that a walsender's sync_standby_priority value is
somehow more stable than the very existence of the process. Yet we
only require a walsender to lock its own mutex while claiming or
disowning its WalSnd entry (by setting or clearing the pid field).
So I think it's nuts to define those fields as being protected by
the global SyncRepLock.

Even without considering the possibility that a walsender has just
started or stopped, we have the problem Fujii-san described that after
a change in the synchronous_standby_names GUC setting, different
walsenders will update their values of sync_standby_priority at
different instants. (BTW, I now notice that Noah had previously
identified this problem at [1]/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com.)

Thus, even while holding SyncRepLock, we do not have a guarantee that
we'll see a consistent set of sync_standby_priority values. In fact
we don't even know that the walsnd array entries still belong to the
processes that last set those values. This is what is breaking
SyncRepGetSyncStandbysPriority, and what it means is that there's
really fundamentally no chance of that function producing trustworthy
results. The "band aid" fixes discussed here might avoid crashing on
the Assert, but they won't fix the problems that (a) the result is
possibly wrong and (b) it can become stale immediately even if it's
right when returned.

Now, there are only two callers of SyncRepGetSyncStandbys:
SyncRepGetSyncRecPtr and pg_stat_get_wal_senders. The latter is
mostly cosmetic (which is a good thing, because to add insult to
injury, it continues to use the list after releasing SyncRepLock;
not that continuing to hold that lock would make things much safer).
If I'm reading the code correctly, the former doesn't really care
exactly which walsenders are sync standbys: all it cares about is
to collect their WAL position pointers.

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not
even need to have that field anymore? Not sure.) We should also
redefine that flag, and sync_standby_priority if it survives, as being
protected by the per-walsender mutex not SyncRepLock. Then, what
SyncRepGetSyncRecPtr would do is just sweep through the walsender
array and collect WAL position pointers from the walsenders that
claim to be sync standbys at the instant that it's inspecting them.
pg_stat_get_wal_senders could also use those flags instead of the
list from SyncRepGetSyncStandbys.

It's likely that this definition would have slightly different
behavior from the current implementation during the period where
the system is absorbing a change in the set of synchronous
walsenders. However, since the current implementation is visibly
wrong during that period anyway, I'm not sure how this could be
worse. And at least we can be certain that SyncRepGetSyncRecPtr
will not include WAL positions from already-dead walsenders in
its calculations, which *is* a hazard in the code as it stands.

I also estimate that this would be noticeably more efficient than
the current code, since the logic to decide who's a sync standby
would only run when we're dealing with walsender start/stop or
SIGHUP, rather than every time SyncRepGetSyncRecPtr runs.

Don't especially want to code this myself, though. Anyone?

regards, tom lane

[1]: /messages/by-id/20200206074552.GB3326097@rfd.leadboat.com

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Tom Lane (#6)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Masahiko Sawada <masahiko.sawada@2ndquadrant.com> writes:

On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

I think we can use max_wal_senders.

Sorry, that's not true. We need another number large enough.

The buildfarm had another three failures of this type today, so that
motivated me to look at it some more. I don't think this code needs
a band-aid fix; I think "nuke from orbit" is more nearly the right
level of response.

The point that I was trying to make originally is that it seems quite
insane to imagine that a walsender's sync_standby_priority value is
somehow more stable than the very existence of the process. Yet we
only require a walsender to lock its own mutex while claiming or
disowning its WalSnd entry (by setting or clearing the pid field).
So I think it's nuts to define those fields as being protected by
the global SyncRepLock.

Right. FWIW, furthermore, even SyncRepConfig->syncrep_method can be
inconsistent among walsenders. I haven't thought that it can be
relied on as always consistent and it is enough that it makes a
consistent result only while the setting and the set of walsenders is
stable.

Even without considering the possibility that a walsender has just
started or stopped, we have the problem Fujii-san described that after
a change in the synchronous_standby_names GUC setting, different
walsenders will update their values of sync_standby_priority at
different instants. (BTW, I now notice that Noah had previously
identified this problem at [1].)

Thus, even while holding SyncRepLock, we do not have a guarantee that
we'll see a consistent set of sync_standby_priority values. In fact
we don't even know that the walsnd array entries still belong to the
processes that last set those values. This is what is breaking
SyncRepGetSyncStandbysPriority, and what it means is that there's
really fundamentally no chance of that function producing trustworthy
results. The "band aid" fixes discussed here might avoid crashing on
the Assert, but they won't fix the problems that (a) the result is
possibly wrong and (b) it can become stale immediately even if it's
right when returned.

Agreed. And I thought that it's not a problem if we had wrong result
temporarily. And the unstability persists for the standby-reply
interval at most (unless the next cause of instability comes).

Now, there are only two callers of SyncRepGetSyncStandbys:
SyncRepGetSyncRecPtr and pg_stat_get_wal_senders. The latter is
mostly cosmetic (which is a good thing, because to add insult to
injury, it continues to use the list after releasing SyncRepLock;
not that continuing to hold that lock would make things much safer).
If I'm reading the code correctly, the former doesn't really care
exactly which walsenders are sync standbys: all it cares about is
to collect their WAL position pointers.

Agreed. To find the sync standby with the largest delay.

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not

Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names. Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now. SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

In short, it seems to me like moving the problem into another
place. But I think that there might be a smarter way to find "I am
sync".

even need to have that field anymore? Not sure.) We should also
redefine that flag, and sync_standby_priority if it survives, as being
protected by the per-walsender mutex not SyncRepLock. Then, what
SyncRepGetSyncRecPtr would do is just sweep through the walsender
array and collect WAL position pointers from the walsenders that
claim to be sync standbys at the instant that it's inspecting them.
pg_stat_get_wal_senders could also use those flags instead of the
list from SyncRepGetSyncStandbys.

It's likely that this definition would have slightly different
behavior from the current implementation during the period where
the system is absorbing a change in the set of synchronous
walsenders. However, since the current implementation is visibly
wrong during that period anyway, I'm not sure how this could be
worse. And at least we can be certain that SyncRepGetSyncRecPtr
will not include WAL positions from already-dead walsenders in
its calculations, which *is* a hazard in the code as it stands.

I also estimate that this would be noticeably more efficient than
the current code, since the logic to decide who's a sync standby
would only run when we're dealing with walsender start/stop or
SIGHUP, rather than every time SyncRepGetSyncRecPtr runs.

Don't especially want to code this myself, though. Anyone?

regards, tom lane

[1] /messages/by-id/20200206074552.GB3326097@rfd.leadboat.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Mon, 13 Apr 2020 15:31:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

The point that I was trying to make originally is that it seems quite
insane to imagine that a walsender's sync_standby_priority value is
somehow more stable than the very existence of the process. Yet we
only require a walsender to lock its own mutex while claiming or
disowning its WalSnd entry (by setting or clearing the pid field).
So I think it's nuts to define those fields as being protected by
the global SyncRepLock.

Right. FWIW, furthermore, even SyncRepConfig->syncrep_method can be
inconsistent among walsenders. I haven't thought that it can be
relied on as always consistent and it is enough that it makes a
consistent result only while the setting and the set of walsenders is
stable.

Yes, the sentene "and (I haven't thought that) it is enough .." is a
mistake of "and I have thought that it is enough that..".

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Race condition in SyncRepGetSyncStandbysPriority

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not

Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names. Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now. SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

Yeah. After looking a bit closer, I think that the current definition
of sync_standby_priority (that is, as the result of local evaluation
of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing
with it. I suggest that what we should do in SyncRepGetSyncRecPtr()
is make one sweep across the WalSnd array, collecting PID,
sync_standby_priority, *and* the WAL pointers from each valid entry.
Then examine that data and decide which WAL value we need, without assuming
that the sync_standby_priority values are necessarily totally consistent.
But in any case we must examine each entry just once while holding its
mutex, not go back to it later expecting it to still be the same.

Another thing that I'm finding interesting is that I now see this is
not at all new code. It doesn't look like SyncRepGetSyncStandbysPriority
has changed much since 2016. So how come we didn't detect this problem
long ago? I searched the buildfarm logs for assertion failures in
syncrep.c, looking back one year, and here's what I found:

The line numbers vary in the back branches, but all of these crashes are
at that same Assert. So (a) yes, this does happen in the back branches,
but (b) some fairly recent change has made it a whole lot more probable.
Neither syncrep.c nor 007_sync_rep.pl have changed much in some time,
so whatever the change was was indirect. Curious. Is it just timing?

I'm giving the side-eye to Noah's recent changes 328c70997 and 421685812,
but this isn't enough evidence to say definitely that that's what boosted
the failure rate.

regards, tom lane

#10

Masahiko Sawada

sawada.mshk@gmail.com

about 6 years ago

In reply to: Tom Lane (#9)

Re: Race condition in SyncRepGetSyncStandbysPriority

On Tue, 14 Apr 2020 at 10:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not

Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names. Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now. SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

Yeah. After looking a bit closer, I think that the current definition
of sync_standby_priority (that is, as the result of local evaluation
of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing
with it. I suggest that what we should do in SyncRepGetSyncRecPtr()
is make one sweep across the WalSnd array, collecting PID,
sync_standby_priority, *and* the WAL pointers from each valid entry.
Then examine that data and decide which WAL value we need, without assuming
that the sync_standby_priority values are necessarily totally consistent.
But in any case we must examine each entry just once while holding its
mutex, not go back to it later expecting it to still be the same.

Can we have a similar approach of sync_standby_defined for
sync_standby_priority? That is, checkpionter is responsible for
changing sync_standby_priority of all walsenders when SIGHUP. That
way, all walsenders can see a consistent view of
sync_standby_priority. And when a walsender starts, it sets
sync_standby_priority by itself. The logic to decide who's a sync
standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
having higher priority along with their WAL positions.

Another thing that I'm finding interesting is that I now see this is
not at all new code. It doesn't look like SyncRepGetSyncStandbysPriority
has changed much since 2016. So how come we didn't detect this problem
long ago? I searched the buildfarm logs for assertion failures in
syncrep.c, looking back one year, and here's what I found:

sysname | branch | snapshot | stage | l
------------+---------------+---------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------
nightjar | REL_10_STABLE | 2019-08-13 23:04:41 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "/pgbuild/root/REL_10_STABLE/pgsql.build/../pgsql/src/backend/replication/syncrep.c", Line: 940)
hoverfly | REL9_6_STABLE | 2019-11-07 17:19:12 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
hoverfly | HEAD | 2019-11-22 12:15:08 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951)
francolin | HEAD | 2020-01-16 23:10:06 | recoveryCheck | TRAP: FailedAssertion("false", File: "/home/andres/build/buildfarm-francolin/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c", Line: 951)
hoverfly | REL_11_STABLE | 2020-02-29 01:34:55 | recoveryCheck | TRAP: FailedAssertion("!(0)", File: "syncrep.c", Line: 946)
hoverfly | REL9_6_STABLE | 2020-03-26 13:51:15 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
hoverfly | REL9_6_STABLE | 2020-04-07 21:52:07 | recoveryCheck | TRAP: FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
curculio | HEAD | 2020-04-11 18:30:21 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951)
sidewinder | HEAD | 2020-04-11 18:45:39 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951)
curculio | HEAD | 2020-04-11 20:30:26 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951)
sidewinder | HEAD | 2020-04-11 21:45:48 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951)
sidewinder | HEAD | 2020-04-13 10:45:35 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951)
conchuela | HEAD | 2020-04-13 16:00:18 | recoveryCheck | TRAP: FailedAssertion("false", File: "/home/pgbf/buildroot/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c", Line: 951)
sidewinder | HEAD | 2020-04-13 18:45:34 | recoveryCheck | TRAP: FailedAssertion("false", File: "syncrep.c", Line: 951)
(14 rows)

The line numbers vary in the back branches, but all of these crashes are
at that same Assert. So (a) yes, this does happen in the back branches,
but (b) some fairly recent change has made it a whole lot more probable.
Neither syncrep.c nor 007_sync_rep.pl have changed much in some time,
so whatever the change was was indirect. Curious. Is it just timing?

Interesting. It's happening on certain animals, not all. Especially
tests with HEAD on sidewinder and curculio, which are NetBSD 7 and
OpenBSD 5.9 respectively, started to fail at a high rate since a
couple of days ago.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Masahiko Sawada (#10)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Tue, 14 Apr 2020 at 10:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not

Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names. Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now. SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

Yeah. After looking a bit closer, I think that the current definition
of sync_standby_priority (that is, as the result of local evaluation
of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing
with it. I suggest that what we should do in SyncRepGetSyncRecPtr()
is make one sweep across the WalSnd array, collecting PID,
sync_standby_priority, *and* the WAL pointers from each valid entry.
Then examine that data and decide which WAL value we need, without assuming
that the sync_standby_priority values are necessarily totally consistent.
But in any case we must examine each entry just once while holding its
mutex, not go back to it later expecting it to still be the same.

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function. Thus
how many times we scan on the array doesn't make any difference.

I think we need to do one of the followings.

A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.

B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.

C) protect walsender priority array from beinig inconsistent.

The (B) is the band aids. To achieve A we need to central controller
of priority config handling. C is:

Can we have a similar approach of sync_standby_defined for
sync_standby_priority? That is, checkpionter is responsible for
changing sync_standby_priority of all walsenders when SIGHUP. That
way, all walsenders can see a consistent view of
sync_standby_priority. And when a walsender starts, it sets
sync_standby_priority by itself. The logic to decide who's a sync
standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
having higher priority along with their WAL positions.

Yeah, it works if we do , but the problem of that way is that to
determin priority of walsenders, we need to know what walsenders are
running. That is, when new walsender comes the process needs to aware
of the arrival (or leaving) right away and reassign the priority of
every wal senders again.

If we accept to share variable-length information among processes,
sharing sync_standby_names or parsed SyncRepConfigData among processes
would work.

Another thing that I'm finding interesting is that I now see this is
not at all new code. It doesn't look like SyncRepGetSyncStandbysPriority
has changed much since 2016. So how come we didn't detect this problem
long ago? I searched the buildfarm logs for assertion failures in
syncrep.c, looking back one year, and here's what I found:

...

The line numbers vary in the back branches, but all of these crashes are
at that same Assert. So (a) yes, this does happen in the back branches,
but (b) some fairly recent change has made it a whole lot more probable.
Neither syncrep.c nor 007_sync_rep.pl have changed much in some time,
so whatever the change was was indirect. Curious. Is it just timing?

Interesting. It's happening on certain animals, not all. Especially
tests with HEAD on sidewinder and curculio, which are NetBSD 7 and
OpenBSD 5.9 respectively, started to fail at a high rate since a
couple of days ago.

Coundn't this align the timing of config reloading? (I didn't checked
anything yet.)

| commit 421685812290406daea58b78dfab0346eb683bbb
| Author: Noah Misch <noah@leadboat.com>
| Date: Sat Apr 11 10:30:00 2020 -0700
|
| When WalSndCaughtUp, sleep only in WalSndWaitForWal().
| Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
| < sentPtr. That is important in logical replication. When the latest
| physical LSN yields no logical replication messages (a common case),
| that keepalive elicits a reply, and processing the reply updates
| pg_stat_replication.replay_lsn. WalSndLoop() lacks that; when
| WalSndLoop() slept, replay_lsn advancement could stall until
| wal_receiver_status_interval elapsed. This sometimes stalled
| src/test/subscription/t/001_rep_changes.pl for up to 10s.
|
| Discussion: /messages/by-id/20200406063649.GA3738151@rfd.leadboat.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: Race condition in SyncRepGetSyncStandbysPriority

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function. Thus
how many times we scan on the array doesn't make any difference.

*Yes it does*. The existing code can deliver entirely broken results
if some walsender exits between where we examine the priorities and
where we fetch the WAL pointers. While that doesn't seem to be the
exact issue we're seeing in the buildfarm, it's still another obvious
bug in this code. I will not accept a "fix" that doesn't fix that.

I think we need to do one of the followings.

A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.
B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.
C) protect walsender priority array from beinig inconsistent.

(B) seems like the only practical solution from here. We could
probably arrange for synchronous update of the priorities when
they change in response to a GUC change, but it doesn't seem to
me to be practical to do that in response to walsender exit.
You'd end up finding that an unexpected walsender exit results
in panic'ing the system, which is no better than where we are now.

It doesn't seem to me to be that hard to implement the desired
semantics for synchronous_standby_names with inconsistent info.
In FIRST mode you basically just need to take the N smallest
priorities you see in the array, but without assuming there are no
duplicates or holes. It might be a good idea to include ties at the
end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
standbys, include the first three of them in the calculation until
the inconsistency is resolved. In ANY mode I don't see that
inconsistent priorities matter at all.

If we accept to share variable-length information among processes,
sharing sync_standby_names or parsed SyncRepConfigData among processes
would work.

Not sure that we really need more than what's being shared now,
ie each process's last-known index in the sync_standby_names list.

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Tom Lane (#12)

Re: Race condition in SyncRepGetSyncStandbysPriority

I wrote:

It doesn't seem to me to be that hard to implement the desired
semantics for synchronous_standby_names with inconsistent info.
In FIRST mode you basically just need to take the N smallest
priorities you see in the array, but without assuming there are no
duplicates or holes. It might be a good idea to include ties at the
end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
standbys, include the first three of them in the calculation until
the inconsistency is resolved. In ANY mode I don't see that
inconsistent priorities matter at all.

Concretely, I think we ought to do the attached, or something pretty
close to it.

I'm not really happy about breaking ties based on walsnd_index,
but I see that there are several TAP test cases that fail if we
do something else. I'm inclined to think those tests are bogus ...
but I won't argue to change them right now.

regards, tom lane

#14

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Tom Lane (#12)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Tue, 14 Apr 2020 09:52:42 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function. Thus
how many times we scan on the array doesn't make any difference.

*Yes it does*. The existing code can deliver entirely broken results
if some walsender exits between where we examine the priorities and
where we fetch the WAL pointers. While that doesn't seem to be the

Ah. I didn't take that as inconsistency. Actually walsender exit
inactivate the corresponding slot by setting pid = 0. In a bad case
(as you mentioned upthread) the entry can be occupied by another wal
sender. However, sync_standby_priority cannot be updated until the
whole work is finished.

exact issue we're seeing in the buildfarm, it's still another obvious
bug in this code. I will not accept a "fix" that doesn't fix that.

I think that the "inconsistency" that can be observed in a process is
disagreement between SyncRepConfig->nmembers and
<each_walsnd_entry>->sync_standby_priority. If any one of walsenders
regards its priority as lower (larger in value) than nmembers in the
"current" process, the assertion fires. If that is the issue, the
issue is not dynamic inconsistency.

# It's the assumption of my band-aid.

I think we need to do one of the followings.

A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.
B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.
C) protect walsender priority array from beinig inconsistent.

(B) seems like the only practical solution from here. We could
probably arrange for synchronous update of the priorities when
they change in response to a GUC change, but it doesn't seem to
me to be practical to do that in response to walsender exit.
You'd end up finding that an unexpected walsender exit results
in panic'ing the system, which is no better than where we are now.

I agree to you as a whole. I thought of several ways to keep the
consistency of the array but all of them looked too much.

It doesn't seem to me to be that hard to implement the desired
semantics for synchronous_standby_names with inconsistent info.
In FIRST mode you basically just need to take the N smallest
priorities you see in the array, but without assuming there are no
duplicates or holes. It might be a good idea to include ties at the
end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
standbys, include the first three of them in the calculation until
the inconsistency is resolved. In ANY mode I don't see that
inconsistent priorities matter at all.

Mmm, the priority lists like 1,2,2,4 are not thought as inconsistency
at all in the context of walsender priority. That happenes stablly if
any two or more walreceivers reported the same application_name. I
believe the existing code is already taking that case into
consideration.

If we accept to share variable-length information among processes,
sharing sync_standby_names or parsed SyncRepConfigData among processes
would work.

Not sure that we really need more than what's being shared now,
ie each process's last-known index in the sync_standby_names list.

If we take the (B), we don't need anymore. (A) and (C) would need
more.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Tom Lane (#13)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

I wrote:

It doesn't seem to me to be that hard to implement the desired
semantics for synchronous_standby_names with inconsistent info.
In FIRST mode you basically just need to take the N smallest
priorities you see in the array, but without assuming there are no
duplicates or holes. It might be a good idea to include ties at the
end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
standbys, include the first three of them in the calculation until
the inconsistency is resolved. In ANY mode I don't see that
inconsistent priorities matter at all.

Concretely, I think we ought to do the attached, or something pretty
close to it.

Looking SyncRepGetSyncStandbys, I agree that it's good not assuming
lowest_priority, which I thought as the culprit of the assertion
failure. The current code intends to use less memory. I don't think
there is a case where only 3 out of 1000 standbys are required to be
sync-standby so collecting all wal senders then sorting them seems
reasonable strategy. The new code looks clearer.

+ stby->is_sync_standby = true; /* might change below */

I'm uneasy with that. In quorum mode all running standbys are marked
as "sync" and that's bogus.

The only users of the flag seems to be:

SyncRepGetSyncRecPtr:
+ *am_sync = sync_standbys[i].is_sync_standby;

and

SyncRepGetOldestSyncRecPtr:
+		/* Ignore candidates that aren't considered synchronous */
+		if (!sync_standbys[i].is_sync_standby)
+			continue;

On the other hand sync_standbys is already sorted in priority order so I think we can get rid of the member by setting *am_sync as the follows.

SyncRepGetSyncRecPtr:
if (sync_standbys[i].is_me)
{
*am_sync = (i < SyncRepConfig->num_sync);
break;
}

And the second user can be as the follows.

SyncRepGetOldestSyncRecPtr:
/* Ignore candidates that aren't considered synchronous */
if (i >= SyncRepConfig->num_sync)
break;

I'm not really happy about breaking ties based on walsnd_index,
but I see that there are several TAP test cases that fail if we
do something else. I'm inclined to think those tests are bogus ...
but I won't argue to change them right now.

Agreed about the tie-breaker.

I'm looking this more closer.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16

Noah Misch

noah@leadboat.com

about 6 years ago

In reply to: Tom Lane (#13)

Re: Race condition in SyncRepGetSyncStandbysPriority

On Tue, Apr 14, 2020 at 04:32:40PM -0400, Tom Lane wrote:

I wrote:

It doesn't seem to me to be that hard to implement the desired
semantics for synchronous_standby_names with inconsistent info.
In FIRST mode you basically just need to take the N smallest
priorities you see in the array, but without assuming there are no
duplicates or holes. It might be a good idea to include ties at the
end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
standbys, include the first three of them in the calculation until
the inconsistency is resolved. In ANY mode I don't see that
inconsistent priorities matter at all.

Concretely, I think we ought to do the attached, or something pretty
close to it.

I'm not really happy about breaking ties based on walsnd_index,
but I see that there are several TAP test cases that fail if we
do something else. I'm inclined to think those tests are bogus ...
but I won't argue to change them right now.

This passes the test battery I wrote in preparation for the 2020-02 thread.

#17

Masahiko Sawada

sawada.mshk@gmail.com

about 6 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: Race condition in SyncRepGetSyncStandbysPriority

On Tue, 14 Apr 2020 at 18:35, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Tue, 14 Apr 2020 at 10:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not

Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names. Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now. SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

Yeah. After looking a bit closer, I think that the current definition
of sync_standby_priority (that is, as the result of local evaluation
of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing
with it. I suggest that what we should do in SyncRepGetSyncRecPtr()
is make one sweep across the WalSnd array, collecting PID,
sync_standby_priority, *and* the WAL pointers from each valid entry.
Then examine that data and decide which WAL value we need, without assuming
that the sync_standby_priority values are necessarily totally consistent.
But in any case we must examine each entry just once while holding its
mutex, not go back to it later expecting it to still be the same.

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function. Thus
how many times we scan on the array doesn't make any difference.

I think we need to do one of the followings.

A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.

B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.

C) protect walsender priority array from beinig inconsistent.

The (B) is the band aids. To achieve A we need to central controller
of priority config handling. C is:

Can we have a similar approach of sync_standby_defined for
sync_standby_priority? That is, checkpionter is responsible for
changing sync_standby_priority of all walsenders when SIGHUP. That
way, all walsenders can see a consistent view of
sync_standby_priority. And when a walsender starts, it sets
sync_standby_priority by itself. The logic to decide who's a sync
standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
having higher priority along with their WAL positions.

Yeah, it works if we do , but the problem of that way is that to
determin priority of walsenders, we need to know what walsenders are
running. That is, when new walsender comes the process needs to aware
of the arrival (or leaving) right away and reassign the priority of
every wal senders again.

I think we don't need to reassign the priority when new walsender
comes or leaves. IIUC The priority is calculated based on only
synchronous_standby_names. Coming or leaving a walsender doesn't
affect other's priorities.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Kyotaro Horiguchi (#15)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Wed, 15 Apr 2020 11:35:58 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

I'm looking this more closer.

It looks to be in the right direction to me.

As mentioned in the previous mail, I removed is_sync_standby from
SycnStandbyData. But just doing that breaks pg_stat_get_wal_senders.
It is an exsting issue but the logic for sync_state (values[10]) looks
odd. Fixed in the attached.

SyncRepInitConfig uses mutex instead of SyncRepLock. Since anyway the
integrity of sync_standby_priority is not guaranteed, it seems OK to
me. It seems fine to remove the assertion and requirement about
SyncRepLock from SyncRepGetSyncRecPtr for the same reason. (Actually
the lock is held, though.)

SyncRepGetSyncStandbysPriority doesn't seem worth existing as a
function. Removed in the attached.

+ num_standbys = SyncRepGetSyncStandbys(&sync_standbys);

The list is no longer consists only of synchronous standbys. I
changed the function name, variable name and tried to adjust related
comments.

It's not what the patch did, but I don't understand why
SyncRepGetNthLatestSyncRecPtr takes SyncRepConfig->num_sync but
SyncRepGetOldest.. accesses it directly. Changed the function
*Oldest* in the attached. I didn't do that but finally, the two
functions can be consolidated, just by moving the selection logic
currently in SyncRepGetSyncRecPtr into the new function.

The resulting patch is attached.

- removed is_sync_standby from SyncRepStandbyData
- Fixed the logic for values[10] in pg_stat_get_wal_senders
- Changed the signature of SyncRepGetOldestSyncRecPtr
- Adjusted some comments to the behavioral change of
SyncRepGet(Sync)Standbys.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Masahiko Sawada (#17)

Re: Race condition in SyncRepGetSyncStandbysPriority

At Wed, 15 Apr 2020 13:01:02 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Tue, 14 Apr 2020 at 18:35, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Tue, 14 Apr 2020 at 10:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys. Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs. This would be a bool that it'd
compute and set alongside sync_standby_priority. (Hm, maybe we'd not

Mmm.. SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names. Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now. SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

Yeah. After looking a bit closer, I think that the current definition
of sync_standby_priority (that is, as the result of local evaluation
of SyncRepGetStandbyPriority()) is OK. The problem is what we're doing
with it. I suggest that what we should do in SyncRepGetSyncRecPtr()
is make one sweep across the WalSnd array, collecting PID,
sync_standby_priority, *and* the WAL pointers from each valid entry.
Then examine that data and decide which WAL value we need, without assuming
that the sync_standby_priority values are necessarily totally consistent.
But in any case we must examine each entry just once while holding its
mutex, not go back to it later expecting it to still be the same.

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function. Thus
how many times we scan on the array doesn't make any difference.

I think we need to do one of the followings.

A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.

B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.

C) protect walsender priority array from beinig inconsistent.

The (B) is the band aids. To achieve A we need to central controller
of priority config handling. C is:

Can we have a similar approach of sync_standby_defined for
sync_standby_priority? That is, checkpionter is responsible for
changing sync_standby_priority of all walsenders when SIGHUP. That
way, all walsenders can see a consistent view of
sync_standby_priority. And when a walsender starts, it sets
sync_standby_priority by itself. The logic to decide who's a sync
standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
having higher priority along with their WAL positions.

Yeah, it works if we do , but the problem of that way is that to
determin priority of walsenders, we need to know what walsenders are
running. That is, when new walsender comes the process needs to aware
of the arrival (or leaving) right away and reassign the priority of
every wal senders again.

I think we don't need to reassign the priority when new walsender
comes or leaves. IIUC The priority is calculated based on only
synchronous_standby_names. Coming or leaving a walsender doesn't
affect other's priorities.

Sorry, the "priority" in this area is a bit confusing. The "priority"
defined by synchronous_standby_names is determined in isolation from
the presence of walsenders. The "priority" in
walsnd->sync_standby_priority needs walsender presence to determine.
I thought of the latter in the discussion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#20

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Kyotaro Horiguchi (#15)

Re: Race condition in SyncRepGetSyncStandbysPriority

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
+ stby->is_sync_standby = true; /* might change below */

I'm uneasy with that. In quorum mode all running standbys are marked
as "sync" and that's bogus.

I don't follow that? The existing coding of SyncRepGetSyncStandbysQuorum
returns all the candidates in its list, so this is isomorphic to that.

Possibly a different name for the flag would be more suited?

On the other hand sync_standbys is already sorted in priority order so I think we can get rid of the member by setting *am_sync as the follows.

SyncRepGetSyncRecPtr:
if (sync_standbys[i].is_me)
{
*am_sync = (i < SyncRepConfig->num_sync);
break;
}

I disagree with this, it will change the behavior in the quorum case.

In any case, a change like this will cause callers to know way more than
they ought to about the ordering of the array. In my mind, the fact that
SyncRepGetSyncStandbysPriority is sorting the array is an internal
implementation detail; I do not want it to be part of the API.

(Apropos to that, I realized from working on this patch that there's
another, completely undocumented assumption in the existing code, that
the integer list will be sorted by walsender index for equal priorities.
I don't like that either, and not just because it's undocumented.)

regards, tom lane

#21

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Tom Lane (#20)

#22

Masahiko Sawada

sawada.mshk@gmail.com

about 6 years ago

In reply to: Kyotaro Horiguchi (#21)

#23