Review for GetWALAvailability()

Started by Fujii Masaoalmost 6 years ago38 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

some WAL files are definitely lost and this slot cannot be used to resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.

keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
wal_segment_size) + 1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of them.
The above should be the following?

Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), wal_keep_segments) + 1

if ((max_slot_wal_keep_size_mb <= 0 ||
max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
means that the claimed files are within max_wal_size". So whatever
max_slot_wal_keep_size value is, IMO that "normal" should be
reported if the WAL files claimed by the slot are within max_wal_size.
Thought?

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.

If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?

BTW, if we want to implement GetWALAvailability() as the document
advertises, we can simplify it like the attached POC patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

simple_getwalavailability_wip.patchtext/plain; charset=UTF-8; name=simple_getwalavailability_wip.patch; x-mac-creator=0; x-mac-type=0Download+9-42
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#1)
Re: Review for GetWALAvailability()

At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

some WAL files are definitely lost and this slot cannot be used to
resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.

keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
wal_segment_size) +
1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of
them.

Oops! I don't want to believe I did that but it's definitely wrong.

The above should be the following?

Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
wal_keep_segments) + 1

Looks reasonable.

if ((max_slot_wal_keep_size_mb <= 0 ||
max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
means that the claimed files are within max_wal_size". So whatever
max_slot_wal_keep_size value is, IMO that "normal" should be
reported if the WAL files claimed by the slot are within max_wal_size.
Thought?

It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept. In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.

Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.

If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?

In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender. I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function. With the current mechanism, there's a
case where once invalidated slot came to revive but we decided to
allow that behavior, but forgot to document that.

Anyway if you see "lost", something bad is being happening.

- lost means that some WAL files are definitely lost and this slot
- cannot be used to resume replication anymore.
+ lost means that some required WAL files are removed and this slot is
+ no longer usable after once disconnected during this status.

If it is crucial that the "lost" state may come back to reserved or
normal state,

+ Note that there are cases where the state moves back to reserved or
+ normal state when all wal senders have left the just removed segment
+ before being terminated.

There is a case where the state moves back to reserved or normal state when wal senders leaves the just removed segment before being terminated.

BTW, if we want to implement GetWALAvailability() as the document
advertises, we can simplify it like the attached POC patch.

I'm not sure it is right that the patch removes wal_keep_segments from
the function.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

fix_GetWalAvailability.patchtext/x-patch; charset=us-asciiDownload+14-7
#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: Review for GetWALAvailability()

At Mon, 15 Jun 2020 13:42:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Oops! I don't want to believe I did that but it's definitely wrong.

Hmm. Quite disappointing. The patch was just a crap.
This is the right patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

fix_GetWalAvailability.patchtext/x-patch; charset=us-asciiDownload+14-7
#4Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: Review for GetWALAvailability()

On 2020/06/15 13:42, Kyotaro Horiguchi wrote:

At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

some WAL files are definitely lost and this slot cannot be used to
resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.

keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
wal_segment_size) +
1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of
them.

Oops! I don't want to believe I did that but it's definitely wrong.

The above should be the following?

Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
wal_keep_segments) + 1

Looks reasonable.

if ((max_slot_wal_keep_size_mb <= 0 ||
max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
means that the claimed files are within max_wal_size". So whatever
max_slot_wal_keep_size value is, IMO that "normal" should be
reported if the WAL files claimed by the slot are within max_wal_size.
Thought?

It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept. In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.

IMO the direct transition of the state from normal to lost is ok to me
if each state is clearly defined.

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.

Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.

I don't think this change is enough. For example, when max_slot_wal_keep_size
is smaller than max_wal_size and the amount of WAL files claimed by the slot
is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But which is
inconsistent with the meaning of "reserved" in the docs.

To consider what should be reported in wal_status, could you tell me what
purpose and how the users is expected to use this information?

If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?

In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender. I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function. With the current mechanism, there's a
case where once invalidated slot came to revive but we decided to
allow that behavior, but forgot to document that.

Anyway if you see "lost", something bad is being happening.

- lost means that some WAL files are definitely lost and this slot
- cannot be used to resume replication anymore.
+ lost means that some required WAL files are removed and this slot is
+ no longer usable after once disconnected during this status.

If it is crucial that the "lost" state may come back to reserved or
normal state,

+ Note that there are cases where the state moves back to reserved or
+ normal state when all wal senders have left the just removed segment
+ before being terminated.

There is a case where the state moves back to reserved or normal state when wal senders leaves the just removed segment before being terminated.

Even if walsender is terminated during the state "lost", unless checkpointer
removes the required WAL files, the state can go back to "reserved" after
new replication connection is established. This is the same as what you're
explaining at the above?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: Review for GetWALAvailability()

On 2020/06/15 13:42, Kyotaro Horiguchi wrote:

At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

some WAL files are definitely lost and this slot cannot be used to
resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.

keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
wal_segment_size) +
1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of
them.

Oops! I don't want to believe I did that but it's definitely wrong.

The above should be the following?

Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
wal_keep_segments) + 1

Looks reasonable.

if ((max_slot_wal_keep_size_mb <= 0 ||
max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
means that the claimed files are within max_wal_size". So whatever
max_slot_wal_keep_size value is, IMO that "normal" should be
reported if the WAL files claimed by the slot are within max_wal_size.
Thought?

It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept. In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.

Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.

If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?

In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender. I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.

BTW, I read the code of InvalidateObsoleteReplicationSlots() and probably
found some issues in it.

1. Each cycle of the "for" loop in InvalidateObsoleteReplicationSlots()
emits the log message "terminating walsender ...". This means that
if it takes more than 10ms for walsender to exit after it's signaled,
the second and subsequent cycles would happen and output the same
log message several times. IMO that log message should be output
only once.

2. InvalidateObsoleteReplicationSlots() uses the loop to scan replication
slots array and uses the "for" loop in each scan. Also it calls
ReplicationSlotAcquire() for each "for" loop cycle, and
ReplicationSlotAcquire() uses another loop to scan replication slots
array. I don't think this is good design.

ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
InvalidateObsoleteReplicationSlots() already know the index of the slot
that we want to find. The attached patch does that. Thought?

3. There is a corner case where the termination of walsender cleans up
the temporary replication slot while InvalidateObsoleteReplicationSlots()
is sleeping on ConditionVariableTimedSleep(). In this case,
ReplicationSlotAcquire() is called in the subsequent cycle of the "for"
loop, cannot find the slot and then emits ERROR message. This leads
to the failure of checkpoint by the checkpointer.

To avoid this case, if SAB_Inquire is specified, ReplicationSlotAcquire()
should return the special value instead of emitting ERROR even when
it cannot find the slot. Also InvalidateObsoleteReplicationSlots() should
handle that special returned value.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

replication_slot_acquire.patchtext/plain; charset=UTF-8; name=replication_slot_acquire.patch; x-mac-creator=0; x-mac-type=0Download+30-9
#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#4)
Re: Review for GetWALAvailability()

At Mon, 15 Jun 2020 18:59:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept. In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.

IMO the direct transition of the state from normal to lost is ok to me
if each state is clearly defined.

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.

Does the following make sense?
https://www.postgresql.org/docs/13/view-pg-replication-slots.html
normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.

I don't think this change is enough. For example, when
max_slot_wal_keep_size
is smaller than max_wal_size and the amount of WAL files claimed by
the slot
is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But
which is
inconsistent with the meaning of "reserved" in the docs.

You're right.

To consider what should be reported in wal_status, could you tell me
what
purpose and how the users is expected to use this information?

I saw that the "reserved" is the state where slots are working to
retain segments, and "normal" is the state to indicate that "WAL
segments are within max_wal_size", which is orthogonal to the notion
of "reserved". So it seems to me useless when the retained WAL
segments cannot exceeds max_wal_size.

With longer description they would be:

"reserved under max_wal_size"
"reserved over max_wal_size"
"lost some segements"

Come to think of that, I realized that my trouble was just the
wording. Are the following wordings make sense to you?

"reserved" - retained within max_wal_size
"extended" - retained over max_wal_size
"lost" - lost some segments

With these wordings I can live with "not extended"=>"lost". Of course
more appropriate wording are welcome.

Even if walsender is terminated during the state "lost", unless
checkpointer
removes the required WAL files, the state can go back to "reserved"
after
new replication connection is established. This is the same as what
you're
explaining at the above?

GetWALAvailability checks restart_lsn against lastRemovedSegNo, thus
the "lost" cannot be seen unless checkpointer actually have removed
the segment at restart_lsn (and restart_lsn has not been invalidated).
However, walsenders are killed before that segments are actually
removed so there're cases where physical walreceiver reconnects before
RemoveOldXloFiles removes all segments, then removed after
reconnection. "lost" can go back to "resrved" in that case. (Physical
walreceiver can connect to invalid-restart_lsn slot)

I noticed the another issue. If some required WALs are removed, the
slot will be "invalidated", that is, restart_lsn is set to invalid
value. As the result we hardly see the "lost" state.

It can be "fixed" by remembering the validity of a slot separately
from restart_lsn. Is that worth doing?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#5)
Re: Review for GetWALAvailability()

At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

In short, it is known behavior but it was judged as useless to prevent
that.
That can happen when checkpointer removes up to the segment that is
being read by walsender. I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.
While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.

BTW, I read the code of InvalidateObsoleteReplicationSlots() and
probably
found some issues in it.

1. Each cycle of the "for" loop in
InvalidateObsoleteReplicationSlots()
emits the log message "terminating walsender ...". This means that
if it takes more than 10ms for walsender to exit after it's signaled,
the second and subsequent cycles would happen and output the same
log message several times. IMO that log message should be output
only once.

Sounds reasonable.

2. InvalidateObsoleteReplicationSlots() uses the loop to scan
replication
slots array and uses the "for" loop in each scan. Also it calls
ReplicationSlotAcquire() for each "for" loop cycle, and
ReplicationSlotAcquire() uses another loop to scan replication slots
array. I don't think this is good design.

ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
InvalidateObsoleteReplicationSlots() already know the index of the
slot
that we want to find. The attached patch does that. Thought?

The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.

The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot. The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe. Maybe some assertion is needed?

3. There is a corner case where the termination of walsender cleans up
the temporary replication slot while
InvalidateObsoleteReplicationSlots()
is sleeping on ConditionVariableTimedSleep(). In this case,
ReplicationSlotAcquire() is called in the subsequent cycle of the
"for"
loop, cannot find the slot and then emits ERROR message. This leads
to the failure of checkpoint by the checkpointer.

Agreed.

To avoid this case, if SAB_Inquire is specified,
ReplicationSlotAcquire()
should return the special value instead of emitting ERROR even when
it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
should
handle that special returned value.

I thought the same thing hearing that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: Review for GetWALAvailability()

On 2020/06/16 14:00, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

In short, it is known behavior but it was judged as useless to prevent
that.
That can happen when checkpointer removes up to the segment that is
being read by walsender. I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.
While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.

BTW, I read the code of InvalidateObsoleteReplicationSlots() and
probably
found some issues in it.

1. Each cycle of the "for" loop in
InvalidateObsoleteReplicationSlots()
emits the log message "terminating walsender ...". This means that
if it takes more than 10ms for walsender to exit after it's signaled,
the second and subsequent cycles would happen and output the same
log message several times. IMO that log message should be output
only once.

Sounds reasonable.

2. InvalidateObsoleteReplicationSlots() uses the loop to scan
replication
slots array and uses the "for" loop in each scan. Also it calls
ReplicationSlotAcquire() for each "for" loop cycle, and
ReplicationSlotAcquire() uses another loop to scan replication slots
array. I don't think this is good design.

ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
InvalidateObsoleteReplicationSlots() already know the index of the
slot
that we want to find. The attached patch does that. Thought?

The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.

The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot. The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe.

Yes, that change might not be safe. So I'm thinking another approach to
fix the issues.

Maybe some assertion is needed?

3. There is a corner case where the termination of walsender cleans up
the temporary replication slot while
InvalidateObsoleteReplicationSlots()
is sleeping on ConditionVariableTimedSleep(). In this case,
ReplicationSlotAcquire() is called in the subsequent cycle of the
"for"
loop, cannot find the slot and then emits ERROR message. This leads
to the failure of checkpoint by the checkpointer.

Agreed.

To avoid this case, if SAB_Inquire is specified,
ReplicationSlotAcquire()
should return the special value instead of emitting ERROR even when
it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
should
handle that special returned value.

I thought the same thing hearing that.

While reading InvalidateObsoleteReplicationSlots() code, I found another issue.

ereport(LOG,
(errmsg("terminating walsender %d because replication slot \"%s\" is too far behind",
wspid, NameStr(slotname))));
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.

int wspid = ReplicationSlotAcquire(NameStr(slotname),
SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?

Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to call
ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
active_pid is zero. No?

If my understanding is right, I'd like to propose the attached patch.
It introduces DeactivateReplicationSlot() and replace the "for" loop
in InvalidateObsoleteReplicationSlots() with it. ReplicationSlotAcquire()
and ReplicationSlotRelease() are no longer called there.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

invalidate_obsolete_replication_slots.patchtext/plain; charset=UTF-8; name=invalidate_obsolete_replication_slots.patch; x-mac-creator=0; x-mac-type=0Download+64-31
#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#6)
Re: Review for GetWALAvailability()

On 2020-Jun-16, Kyotaro Horiguchi wrote:

I noticed the another issue. If some required WALs are removed, the
slot will be "invalidated", that is, restart_lsn is set to invalid
value. As the result we hardly see the "lost" state.

It can be "fixed" by remembering the validity of a slot separately
from restart_lsn. Is that worth doing?

We discussed this before. I agree it would be better to do this
in some way, but I fear that if we do it naively, some code might exist
that reads the LSN without realizing that it needs to check the validity
flag first.

On the other hand, maybe this is not a problem in practice, because if
such a bug occurs, what will happen is that trying to read WAL from such
a slot will return the error message that the WAL file cannot be found.
Maybe this is acceptable?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#8)
Re: Review for GetWALAvailability()

On 2020-Jun-17, Fujii Masao wrote:

While reading InvalidateObsoleteReplicationSlots() code, I found another issue.

ereport(LOG,
(errmsg("terminating walsender %d because replication slot \"%s\" is too far behind",
wspid, NameStr(slotname))));
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.

Good point.

int wspid = ReplicationSlotAcquire(NameStr(slotname),
SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?

Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to call
ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
active_pid is zero. No?

I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.

+		/*
+		 * Signal to terminate the process using the replication slot.
+		 *
+		 * Try to signal every 100ms until it succeeds.
+		 */
+		if (!killed && kill(active_pid, SIGTERM) == 0)
+			killed = true;
+		ConditionVariableTimedSleep(&slot->active_cv, 100,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+	} while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#8)
Re: Review for GetWALAvailability()

At Wed, 17 Jun 2020 00:46:38 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

2. InvalidateObsoleteReplicationSlots() uses the loop to scan
replication
slots array and uses the "for" loop in each scan. Also it calls
ReplicationSlotAcquire() for each "for" loop cycle, and
ReplicationSlotAcquire() uses another loop to scan replication slots
array. I don't think this is good design.

ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
InvalidateObsoleteReplicationSlots() already know the index of the
slot
that we want to find. The attached patch does that. Thought?

The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.
The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot. The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe.

Yes, that change might not be safe. So I'm thinking another approach
to
fix the issues.

Maybe some assertion is needed?

3. There is a corner case where the termination of walsender cleans up
the temporary replication slot while
InvalidateObsoleteReplicationSlots()
is sleeping on ConditionVariableTimedSleep(). In this case,
ReplicationSlotAcquire() is called in the subsequent cycle of the
"for"
loop, cannot find the slot and then emits ERROR message. This leads
to the failure of checkpoint by the checkpointer.

Agreed.

To avoid this case, if SAB_Inquire is specified,
ReplicationSlotAcquire()
should return the special value instead of emitting ERROR even when
it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
should
handle that special returned value.

I thought the same thing hearing that.

While reading InvalidateObsoleteReplicationSlots() code, I found
another issue.

ereport(LOG,
(errmsg("terminating walsender %d
because replication slot \"%s\" is too
far behind",
wspid,
NameStr(slotname))));
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.

Agreed.

int wspid = ReplicationSlotAcquire(NameStr(slotname),
SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot
as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?
Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to
call
ReplicationSlotRelease()? ISTM that we don't need to do that if the
slot's
active_pid is zero. No?

My understanding of the reason is that we update a slot value here.
The restriction allows the owner of a slot to assume that all the slot
values don't voluntarily change.

slot.h:104
| * - Individual fields are protected by mutex where only the backend owning
| * the slot is authorized to update the fields from its own slot. The
| * backend owning the slot does not need to take this lock when reading its
| * own fields, while concurrent backends not owning this slot should take the
| * lock when reading this slot's data.

If my understanding is right, I'd like to propose the attached patch.
It introduces DeactivateReplicationSlot() and replace the "for" loop
in InvalidateObsoleteReplicationSlots() with
it. ReplicationSlotAcquire()
and ReplicationSlotRelease() are no longer called there.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#9)
Re: Review for GetWALAvailability()

At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

On 2020-Jun-16, Kyotaro Horiguchi wrote:

I noticed the another issue. If some required WALs are removed, the
slot will be "invalidated", that is, restart_lsn is set to invalid
value. As the result we hardly see the "lost" state.

It can be "fixed" by remembering the validity of a slot separately
from restart_lsn. Is that worth doing?

We discussed this before. I agree it would be better to do this
in some way, but I fear that if we do it naively, some code might exist
that reads the LSN without realizing that it needs to check the validity
flag first.

Yes, that was my main concern on it. That's error-prone. How about
remembering the LSN where invalidation happened? It's safe since no
others than slot-monitoring functions would look
last_invalidated_lsn. It can be reset if active_pid is a valid pid.

InvalidateObsoleteReplicationSlots:
...
SpinLockAcquire(&s->mutex);
+ s->data.last_invalidated_lsn = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(&s->mutex);

On the other hand, maybe this is not a problem in practice, because if
such a bug occurs, what will happen is that trying to read WAL from such
a slot will return the error message that the WAL file cannot be found.
Maybe this is acceptable?

I'm not sure. For my part a problem of that would we need to look
into server logs to know what is acutally going on.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#10)
Re: Review for GetWALAvailability()

On 2020/06/17 3:50, Alvaro Herrera wrote:

On 2020-Jun-17, Fujii Masao wrote:

While reading InvalidateObsoleteReplicationSlots() code, I found another issue.

ereport(LOG,
(errmsg("terminating walsender %d because replication slot \"%s\" is too far behind",
wspid, NameStr(slotname))));
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.

Good point.

So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.

int wspid = ReplicationSlotAcquire(NameStr(slotname),
SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?

Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to call
ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
active_pid is zero. No?

I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.

Understood. Thanks!

+		/*
+		 * Signal to terminate the process using the replication slot.
+		 *
+		 * Try to signal every 100ms until it succeeds.
+		 */
+		if (!killed && kill(active_pid, SIGTERM) == 0)
+			killed = true;
+		ConditionVariableTimedSleep(&slot->active_cv, 100,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+	} while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#13)
Re: Review for GetWALAvailability()

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:

So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.

Well, if we could distinguish a walsender from a non-walsender process,
then maybe it would make sense to leave backends alive. But do we want
that? I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

+		/*
+		 * Signal to terminate the process using the replication slot.
+		 *
+		 * Try to signal every 100ms until it succeeds.
+		 */
+		if (!killed && kill(active_pid, SIGTERM) == 0)
+			killed = true;
+		ConditionVariableTimedSleep(&slot->active_cv, 100,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+	} while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

I guess kill() can also fail if the PID now belongs to a process owned
by a different user. I think we've disregarded very quick reuse of
PIDs, so we needn't concern ourselves with it.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#14)
Re: Review for GetWALAvailability()

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:

So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.

Well, if we could distinguish a walsender from a non-walsender process,
then maybe it would make sense to leave backends alive. But do we want
that? I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?

+		/*
+		 * Signal to terminate the process using the replication slot.
+		 *
+		 * Try to signal every 100ms until it succeeds.
+		 */
+		if (!killed && kill(active_pid, SIGTERM) == 0)
+			killed = true;
+		ConditionVariableTimedSleep(&slot->active_cv, 100,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+	} while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

I guess kill() can also fail if the PID now belongs to a process owned
by a different user. I think we've disregarded very quick reuse of
PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected. But we may make an extra call
to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#12)
Re: Review for GetWALAvailability()

At Wed, 17 Jun 2020 10:17:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

On 2020-Jun-16, Kyotaro Horiguchi wrote:

I noticed the another issue. If some required WALs are removed, the
slot will be "invalidated", that is, restart_lsn is set to invalid
value. As the result we hardly see the "lost" state.

It can be "fixed" by remembering the validity of a slot separately
from restart_lsn. Is that worth doing?

We discussed this before. I agree it would be better to do this
in some way, but I fear that if we do it naively, some code might exist
that reads the LSN without realizing that it needs to check the validity
flag first.

Yes, that was my main concern on it. That's error-prone. How about
remembering the LSN where invalidation happened? It's safe since no
others than slot-monitoring functions would look
last_invalidated_lsn. It can be reset if active_pid is a valid pid.

InvalidateObsoleteReplicationSlots:
...
SpinLockAcquire(&s->mutex);
+ s->data.last_invalidated_lsn = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(&s->mutex);

The attached does that (Poc). No document fix included.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

GetWalAvailability_change_statuses_fix_lost.patchtext/x-patch; charset=us-asciiDownload+66-28
#17Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#15)
Re: Review for GetWALAvailability()

On 2020/06/17 12:10, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:

So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.

Well, if we could distinguish a walsender from a non-walsender process,
then maybe it would make sense to leave backends alive. But do we want
that? I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?

I have no better opinion about this. So I agree to leave the logic as it is
at least for now, i.e., we terminate the process owning the slot whatever
the type of process is.

+		/*
+		 * Signal to terminate the process using the replication slot.
+		 *
+		 * Try to signal every 100ms until it succeeds.
+		 */
+		if (!killed && kill(active_pid, SIGTERM) == 0)
+			killed = true;
+		ConditionVariableTimedSleep(&slot->active_cv, 100,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+	} while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

I guess kill() can also fail if the PID now belongs to a process owned
by a different user.

Yes. This case means that the PostgreSQL process using the slot disappeared
and the same PID was assigned to non-PostgreSQL process. So if kill() fails
for this reason, we don't need to kill() again.

I think we've disregarded very quick reuse of

PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected. But we may make an extra call
to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.

Sorry I failed to understand your point...

Anyway, the attached is the updated version of the patch. This fixes
all the issues in InvalidateObsoleteReplicationSlots() that I reported
upthread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

invalidate_obsolete_replication_slots_v2.patchtext/plain; charset=UTF-8; name=invalidate_obsolete_replication_slots_v2.patch; x-mac-creator=0; x-mac-type=0Download+126-70
#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#17)
Re: Review for GetWALAvailability()

At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/06/17 12:10, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:

So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about
this
case into the docs? Otherwise the users would be surprised at
termination
of backends by max_slot_wal_keep_size. I guess that it's basically
rarely
happen, though.

Well, if we could distinguish a walsender from a non-walsender
process,
then maybe it would make sense to leave backends alive. But do we
want
that? I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?

I have no better opinion about this. So I agree to leave the logic as
it is
at least for now, i.e., we terminate the process owning the slot
whatever
the type of process is.

Agreed.

+		/*
+		 * Signal to terminate the process using the replication slot.
+		 *
+		 * Try to signal every 100ms until it succeeds.
+		 */
+		if (!killed && kill(active_pid, SIGTERM) == 0)
+			killed = true;
+		ConditionVariableTimedSleep(&slot->active_cv, 100,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+	} while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many
times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

You mean; in this code path, signaling fails only when the target
process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

I guess kill() can also fail if the PID now belongs to a process owned
by a different user.

Yes. This case means that the PostgreSQL process using the slot
disappeared
and the same PID was assigned to non-PostgreSQL process. So if kill()
fails
for this reason, we don't need to kill() again.

I think we've disregarded very quick reuse of

PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected. But we may make an extra call
to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.

Sorry I failed to understand your point...

My point is the ConditionVariableTimedSleep does *not* sleep on the CV
first time in this usage. The new version anyway avoids useless
kill(2) call, but still may make an extra call to
ReplicationSlotAcquireInternal. I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.

Anyway, the attached is the updated version of the patch. This fixes
all the issues in InvalidateObsoleteReplicationSlots() that I reported
upthread.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#18)
Re: Review for GetWALAvailability()

On 2020/06/17 17:30, Kyotaro Horiguchi wrote:

At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/06/17 12:10, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:

So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about
this
case into the docs? Otherwise the users would be surprised at
termination
of backends by max_slot_wal_keep_size. I guess that it's basically
rarely
happen, though.

Well, if we could distinguish a walsender from a non-walsender
process,
then maybe it would make sense to leave backends alive. But do we
want
that? I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?

I have no better opinion about this. So I agree to leave the logic as
it is
at least for now, i.e., we terminate the process owning the slot
whatever
the type of process is.

Agreed.

+		/*
+		 * Signal to terminate the process using the replication slot.
+		 *
+		 * Try to signal every 100ms until it succeeds.
+		 */
+		if (!killed && kill(active_pid, SIGTERM) == 0)
+			killed = true;
+		ConditionVariableTimedSleep(&slot->active_cv, 100,
+									WAIT_EVENT_REPLICATION_SLOT_DROP);
+	} while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many
times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

You mean; in this code path, signaling fails only when the target
process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

I guess kill() can also fail if the PID now belongs to a process owned
by a different user.

Yes. This case means that the PostgreSQL process using the slot
disappeared
and the same PID was assigned to non-PostgreSQL process. So if kill()
fails
for this reason, we don't need to kill() again.

I think we've disregarded very quick reuse of

PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected. But we may make an extra call
to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.

Sorry I failed to understand your point...

My point is the ConditionVariableTimedSleep does *not* sleep on the CV
first time in this usage. The new version anyway avoids useless
kill(2) call, but still may make an extra call to
ReplicationSlotAcquireInternal. I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.

OK, so what about the attached patch? I added ConditionVariablePrepareToSleep()
just before entering the "for" loop in InvalidateObsoleteReplicationSlots().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

invalidate_obsolete_replication_slots_v3.patchtext/plain; charset=UTF-8; name=invalidate_obsolete_replication_slots_v3.patch; x-mac-creator=0; x-mac-type=0Download+143-71
#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#19)
Re: Review for GetWALAvailability()

I think passing the slot name when the slot is also passed is useless
and wasteful; it'd be better to pass NULL for the name and ignore the
strcmp() in that case -- in fact I suggest to forbid passing both name
and slot. (Any failure there would risk raising an error during
checkpoint, which is undesirable.)

So I propose the following tweaks to your patch, and otherwise +1.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

tweak.patchtext/x-diff; charset=us-asciiDownload+10-10
#21Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#16)
#22Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#20)
#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#19)
#24Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#23)
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#24)
#26Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#23)
#27Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#26)
#28Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#27)
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#16)
#30Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#29)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#6)
#32Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#31)
#33Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#32)
#34Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#33)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#34)
#36Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#35)
#37Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#36)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#37)