Unexpected behavior when setting "idle_replication_slot_timeout"
Hi all,
I am exploring the new setting "idle_replication_slot_timeout" in Postgres
18; for testing purposes, I set the value to "30s", which, unexpectedly to
me, didn't cause an idle slot to be invalidated when I triggered a
checkpoint after the timeout had been reached.
The docs of the option state that the value is rounded up or down to the
nearest full minute, so I reckon "30s" gets rounded down to 0, thus
effectively disabling the feature. It might be less surprising to users if
values between "1s" and "59s" get actually always rounded up to one minute?
Arguably, that'd seem the more intuitive behavior to me. Alternatively,
logging a warning might be considered for values between "1s" and "30s"?
Curious what folks here think.
Thanks and all the best,
--Gunnar
On Fri, Jul 4, 2025 at 4:35 PM Gunnar Morling
<gunnar.morling@googlemail.com> wrote:
Hi all,
I am exploring the new setting "idle_replication_slot_timeout" in Postgres 18; for testing purposes, I set the value to "30s", which, unexpectedly to me, didn't cause an idle slot to be invalidated when I triggered a checkpoint after the timeout had been reached.
The docs of the option state that the value is rounded up or down to the nearest full minute, so I reckon "30s" gets rounded down to 0, thus effectively disabling the feature. It might be less surprising to users if values between "1s" and "59s" get actually always rounded up to one minute? Arguably, that'd seem the more intuitive behavior to me. Alternatively, logging a warning might be considered for values between "1s" and "30s"? Curious what folks here think.
Thanks for bringing this up!
Yes, this is expected behavior, idle_replication_slot_timeout accepts
values in minutes, so a setting like "30s" is rounded down to 0,
effectively disabling the timeout, while values >= "31s" are rounded
up to 1.
This behavior isn’t specific to this GUC as Postgres generally rounds
values below a parameter’s minimum unit without a warning. For
example, wal_summary_keep_time and log_rotation_age behave the same
way.
--
Thanks,
Nisha
On 2025/07/04 21:14, Nisha Moond wrote:
On Fri, Jul 4, 2025 at 4:35 PM Gunnar Morling
<gunnar.morling@googlemail.com> wrote:Hi all,
I am exploring the new setting "idle_replication_slot_timeout" in Postgres 18; for testing purposes, I set the value to "30s", which, unexpectedly to me, didn't cause an idle slot to be invalidated when I triggered a checkpoint after the timeout had been reached.
The docs of the option state that the value is rounded up or down to the nearest full minute, so I reckon "30s" gets rounded down to 0, thus effectively disabling the feature.
When I first tried using idle_replication_slot_timeout, I also encountered this issue.
It might be less surprising to users if values between "1s" and "59s" get actually always rounded up to one minute? Arguably, that'd seem the more intuitive behavior to me. Alternatively, logging a warning might be considered for values between "1s" and "30s"? Curious what folks here think.
Thanks for bringing this up!
Yes, this is expected behavior, idle_replication_slot_timeout accepts
values in minutes, so a setting like "30s" is rounded down to 0,
effectively disabling the timeout, while values >= "31s" are rounded
up to 1.This behavior isn’t specific to this GUC as Postgres generally rounds
values below a parameter’s minimum unit without a warning. For
example, wal_summary_keep_time and log_rotation_age behave the same
way.
Right.
But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
On 2025/07/04 23:12, Fujii Masao wrote:
But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
Attached patch changes unit of idle_replication_slot_timeout to seconds.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
Attachments:
v1-0001-Change-unit-of-idle_replication_slot_timeout-to-s.patchtext/plain; charset=UTF-8; name=v1-0001-Change-unit-of-idle_replication_slot_timeout-to-s.patchDownload+12-16
On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
On 2025/07/04 23:12, Fujii Masao wrote:
But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.Attached patch changes unit of idle_replication_slot_timeout to seconds.
-1
I think that the reason that several users tried to set it it less than a minute
is that they were trying to test the feature and didn't want to wait long.
I cannot imagine that anybody will want to abandon a standby server just
because it is idle for more than 30 seconds.
For me, there would be two appealing alternatives:
1. Always round up to the next minute.
2. Use the value -1 to deactivate the feature.
Optionally, we could forbid the value 0 in this case.
Stepping back a bit, I am not happy with the documentation. I looked at it,
trying to figure out what the parameter does, and I am none the wiser.
What exactly does it mean for a replication slot to idle?
- Does it mean that the standby is not connected?
- Does it mean that the standby is connected, but gives no feedback?
Probably not, but I only guess that because I know that there is a different
parameter for that.
- Does it mean that the standby is giving feedback, but that feedback doesn't
indicate progress?
I think we could do better here.
Yours,
Laurenz Albe
On 2025/07/05 1:07, Laurenz Albe wrote:
On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
On 2025/07/04 23:12, Fujii Masao wrote:
But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.Attached patch changes unit of idle_replication_slot_timeout to seconds.
-1
I think that the reason that several users tried to set it it less than a minute
is that they were trying to test the feature and didn't want to wait long.
I cannot imagine that anybody will want to abandon a standby server just
because it is idle for more than 30 seconds.
Maybe. But changing the unit to seconds doesn't make things worse, does it?
It still allows users to set values greater than 1 minute, and also less than
1 minute for debugging or testing purposes, if needed.
Or are you suggesting we should disallow values below 1 minute?
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
Hey all,
Thanks for all the replies!
Indeed I also wouldn't expect anyone to set this timeout to 30s for
purposes other than testing or exploration. But then again, if the option
can be set with second precision, is there any reason to not honor this? To
me, it boils down to the principle of least surprise for users, in
particular when it comes to a setting like "30s" being interpreted as
disabling the feature.
What exactly does it mean for a replication slot to idle?
I'd like to echo this; I also was/am confused by the term "idle" here; it
isn't fully clear to me what it means for a slot to be idle, and in
particular whether it is different from a slot being inactive.
pg_replication_slots uses the "active"/"inactive" terminology, and it seems
that this is what idleness is about, as per the docs of the setting [1]https://www.postgresql.org/docs/18/runtime-config-replication.html#GUC-IDLE-REPLICATION-SLOT-TIMEOUT:
[Users] can force a checkpoint to promptly invalidate inactive slots. The
duration of slot inactivity is calculated using the slot's
pg_replication_slots.inactive_since value.
If so, "inactive_replication_slot_timeout" might be a more consistent name
for that option?
Best,
--Gunnar
[1]: https://www.postgresql.org/docs/18/runtime-config-replication.html#GUC-IDLE-REPLICATION-SLOT-TIMEOUT
https://www.postgresql.org/docs/18/runtime-config-replication.html#GUC-IDLE-REPLICATION-SLOT-TIMEOUT
On Fri, 4 Jul 2025 at 18:24, Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:
Show quoted text
On 2025/07/05 1:07, Laurenz Albe wrote:
On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
On 2025/07/04 23:12, Fujii Masao wrote:
But I wonder why the current unit of this GUC is minutes
(GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit toseconds
(GUC_UNIT_S). Also which would reduces the chance of the reported
trouble.
Attached patch changes unit of idle_replication_slot_timeout to seconds.
-1
I think that the reason that several users tried to set it it less than
a minute
is that they were trying to test the feature and didn't want to wait
long.
I cannot imagine that anybody will want to abandon a standby server just
because it is idle for more than 30 seconds.Maybe. But changing the unit to seconds doesn't make things worse, does it?
It still allows users to set values greater than 1 minute, and also less
than
1 minute for debugging or testing purposes, if needed.Or are you suggesting we should disallow values below 1 minute?
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
On Sat, 2025-07-05 at 01:24 +0900, Fujii Masao wrote:
On 2025/07/05 1:07, Laurenz Albe wrote:
On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
On 2025/07/04 23:12, Fujii Masao wrote:
But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.Attached patch changes unit of idle_replication_slot_timeout to seconds.
-1
I think that the reason that several users tried to set it it less than a minute
is that they were trying to test the feature and didn't want to wait long.
I cannot imagine that anybody will want to abandon a standby server just
because it is idle for more than 30 seconds.Maybe. But changing the unit to seconds doesn't make things worse, does it?
It still allows users to set values greater than 1 minute, and also less than
1 minute for debugging or testing purposes, if needed.Or are you suggesting we should disallow values below 1 minute?
I guess you are right. There is no problem with second precision, even if
the use case in this case was artificial.
I withdraw my objection.
Gunnar Morlin wrote:
I also was/am confused by the term "idle" here; it isn't fully clear to me
what it means for a slot to be idle, and in particular whether it is different
from a slot being inactive. [...]If so, "inactive_replication_slot_timeout" might be a more consistent name
for that option?
Perhaps. I must say that I don't care so much about the name, as long as the
documentation doesn't leave any doubts.
Yours,
Laurenz Albe
On Fri, Jul 4, 2025 at 9:54 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2025/07/05 1:07, Laurenz Albe wrote:
On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
On 2025/07/04 23:12, Fujii Masao wrote:
But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.Attached patch changes unit of idle_replication_slot_timeout to seconds.
-1
I think that the reason that several users tried to set it it less than a minute
is that they were trying to test the feature and didn't want to wait long.
I cannot imagine that anybody will want to abandon a standby server just
because it is idle for more than 30 seconds.Maybe. But changing the unit to seconds doesn't make things worse, does it?
It still allows users to set values greater than 1 minute, and also less than
1 minute for debugging or testing purposes, if needed.
We expect the value of this variable to be in hours or, in some cases,
days. Specifying in seconds would be inconvenient for users. Now, I
agree there is a value in testing/debugging to allow it to be seconds,
but the same could be said for other variables whose units are in
minutes, like log_rotation_age and wal_summary_keep_time.
Or are you suggesting we should disallow values below 1 minute?
We should be consistent with other similar GUC variables whose unit is
in GUC_UNIT_MIN.
--
With Regards,
Amit Kapila.
On Fri, Jul 4, 2025 at 9:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
Stepping back a bit, I am not happy with the documentation. I looked at it,
trying to figure out what the parameter does, and I am none the wiser.
What exactly does it mean for a replication slot to idle?- Does it mean that the standby is not connected?
It means the above. The slot is used for purposes other than the
standby as well, so we can't mention something only specific to the
standby. This parameter mainly cares when the slot becomes inactive.
You can get more information by looking at
pg_replication_slots.inactive_since (as indicated by sentence: "The
duration of slot inactivity is calculated using the slot's
pg_replication_slots.inactive_since value.").
I think we could do better here.
Sure, feel free to propose what you think makes it better.
--
With Regards,
Amit Kapila.
On Sat, 2025-07-05 at 09:52 +0530, Amit Kapila wrote:
On Fri, Jul 4, 2025 at 9:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
What exactly does it mean for a replication slot to idle?
- Does it mean that the standby is not connected?
It means the above. The slot is used for purposes other than the
standby as well, so we can't mention something only specific to the
standby.I think we could do better here.
Sure, feel free to propose what you think makes it better.
Done in the attached patch.
We expect the value of this variable to be in hours or, in some cases,
days. Specifying in seconds would be inconvenient for users.
I don't buy that argument. Specifying shared_buffers in 8kB blocks
would be quite inconvenient to most users, but I don't remember any
complaints about it. One of the nice things about the Grand Unified
Configuration is that you can use units different from the default.
On the other hand, if the behavior is clearly documented, as I have
tried to do with my patch, it should be fine. So I'll rest my case if
you apply my patch.
Yours,
Laurenz Albe
Attachments:
v1-0001-Improve-doc-for-idle_replication_slot_timeout.patchtext/x-patch; charset=UTF-8; name=v1-0001-Improve-doc-for-idle_replication_slot_timeout.patchDownload+8-7
On Friday, July 4, 2025, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
One of the nice things about the Grand Unified
Configuration is that you can use units different from the default.
Supplying the value as '2 h', SHOW will display 2h while
pg_settings.setting shows 120 (min).
David J.
On Fri, Jul 4, 2025 at 10:35 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
On the other hand, if the behavior is clearly documented, as I have
tried to do with my patch, it should be fine. So I'll rest my case if
you apply my patch.
We should clearly document how rounding works in section 19.1.1 (which we
mostly do; "If the parameter is of integer type, a final rounding to
integer occurs after any unit conversion.") and not in every time-related
setting that chooses to use something larger than microseconds. So, 30s is
'unit converted' up to 0.5 minutes (not explicitly explained...) then
rounded to zero (which is odd, half normally rounds up...). I'm against
cluttering up the individual settings docs with this detail.
If the change from idle to inactive is needed in the description we
should just admit we named it wrong in the first place. As-is, the
description matches the name and the callout to the field in the second
paragraph precisely clears up what this setting at least cares about. The
reader should be directed to how that field is computed should they need
clarification.
Thus, I'd accept but not find required the idle/inactive wording change to
any of various degrees; and would ask that any clarification regarding
generic setting value interpretation be relegated to 19.1.1 where all such
settings can benefit.
David J.
On Fri, 2025-07-04 at 23:16 -0700, David G. Johnston wrote:
We should clearly document how rounding works in section 19.1.1
(which we mostly do; "If the parameter is of integer type, a final rounding
to integer occurs after any unit conversion.") and not in every
time-related setting that chooses to use something larger than microseconds.
So, 30s is 'unit converted' up to 0.5 minutes (not explicitly explained...)
then rounded to zero (which is odd, half normally rounds up...).
I'm against cluttering up the individual settings docs with this detail.
That's fine with me; do you have a patch?
If the change from idle to inactive is needed in the description we should
just admit we named it wrong in the first place.
I had half a mind to propose renaming the parameter, but I shied from
a lengthy bikeshedding discussion. Reading up on the archives, I see
that Peter Smith proposed the term "idle" in [1]/messages/by-id/CAHut+PtHbYNxPvtMfs7jARbsVcFXL1=C9SO3Q93NgVDgbKN7LQ@mail.gmail.com, and nobody had any
problem with it.
For the record: I would be much more happy if the parameter were called
"inactive_replication_slot_timeout", since we use the term "active" in
"pg_replication_slots". Also, we call connections "idle" when they are
established, but doing nothing, and this parameter is about disconnected
replication connections.
As-is, the description
matches the name and the callout to the field in the second paragraph
precisely clears up what this setting at least cares about. The reader
should be directed to how that field is computed should they need clarification.Thus, I'd accept but not find required the idle/inactive wording change to
any of various degrees; and would ask that any clarification regarding
generic setting value interpretation be relegated to 19.1.1 where all
such settings can benefit.
I am sure that there is some information in these sentences, but I cannot
extract it, even after reading them twice.
Yours,
Laurenz Albe
[1]: /messages/by-id/CAHut+PtHbYNxPvtMfs7jARbsVcFXL1=C9SO3Q93NgVDgbKN7LQ@mail.gmail.com
On Saturday, July 5, 2025, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
I am sure that there is some information in these sentences, but I cannot
extract it, even after reading them twice.
Maybe: “During checkpoint if the interval since
pg_replication_slots.inactive_since and now is larger than this value
pg_replication_slots.conflicting is set to true and
pg_replication_slots.inactive_reason is set to ‘timeout’. See section
wherever for more information on handling conflicted slots.”
Heck, writing this, “idle” is probably better, a slot can recover from
being idle on its own but usually inactive would imply having to do
something to make it active again.
IMO our documentation for replication has serious flaws but this particular
area is clear enough. Like any good timeout the slot is killed if it goes
unused “idle” for some length of time. We can describe that in many ways
but the name, to me, is fully descriptive and consistent with other
timeouts like “idle_in_transaction_timeout”.
David J.
On 2025/07/05 13:06, Amit Kapila wrote:
On Fri, Jul 4, 2025 at 9:54 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2025/07/05 1:07, Laurenz Albe wrote:
On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
On 2025/07/04 23:12, Fujii Masao wrote:
But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.Attached patch changes unit of idle_replication_slot_timeout to seconds.
-1
I think that the reason that several users tried to set it it less than a minute
is that they were trying to test the feature and didn't want to wait long.
I cannot imagine that anybody will want to abandon a standby server just
because it is idle for more than 30 seconds.Maybe. But changing the unit to seconds doesn't make things worse, does it?
It still allows users to set values greater than 1 minute, and also less than
1 minute for debugging or testing purposes, if needed.We expect the value of this variable to be in hours or, in some cases,
days. Specifying in seconds would be inconvenient for users. Now, I
agree there is a value in testing/debugging to allow it to be seconds,
but the same could be said for other variables whose units are in
minutes, like log_rotation_age and wal_summary_keep_time.
Even if we change the unit to seconds (GUC_UNIT_S), it's still possible
to set the timeout values such as hours or days. For example, even with
the patch, we can set the timeout to 365 days:
=# ALTER SYSTEM SET idle_replication_slot_timeout TO '365d';
=# SELECT pg_reload_conf();
=# SHOW idle_replication_slot_timeout ;
idle_replication_slot_timeout
-------------------------------
365d
(1 row)
Do you see any serious downside to switching the unit to seconds? I don't
think it introduces any serious issues. On the contrary, it gives users finer
control over the timeout, and additionally works around the issue
that we're discussing here.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
On 2025/07/05 15:16, David G. Johnston wrote:
On Fri, Jul 4, 2025 at 10:35 PM Laurenz Albe <laurenz.albe@cybertec.at <mailto:laurenz.albe@cybertec.at>> wrote:
On the other hand, if the behavior is clearly documented, as I have
tried to do with my patch, it should be fine. So I'll rest my case if
you apply my patch.We should clearly document how rounding works in section 19.1.1 (which we mostly do; "If the parameter is of integer type, a final rounding to integer occurs after any unit conversion.") and not in every time-related setting that chooses to use something larger than microseconds. So, 30s is 'unit converted' up to 0.5 minutes (not explicitly explained...) then rounded to zero (which is odd, half normally rounds up...).
This happens because in this case rounding is done using rint(3),
which uses banker's rounding by default. While the rint(3)'s rounding method
can be changed with fesetround(), PostgreSQL doesn't seem to change it.
So 0.5 rounds to 0, 1.5 and 2.5 round to 2, 3.5 and 4.5 round to 4, and so on.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
On Saturday, July 5, 2025, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Do you see any serious downside to switching the unit to seconds? I don't
think it introduces any serious issues. On the contrary, it gives users
finer
control over the timeout, and additionally works around the issue
that we're discussing here.
I do not, and would rather we make the change. Minutes are an
unconventional base unit for time in our world and should be avoided.
David J.
On Saturday, July 5, 2025, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2025/07/05 15:16, David G. Johnston wrote:
On Fri, Jul 4, 2025 at 10:35 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:On the other hand, if the behavior is clearly documented, as I have
tried to do with my patch, it should be fine. So I'll rest my case if
you apply my patch.We should clearly document how rounding works in section 19.1.1 (which we
mostly do; "If the parameter is of integer type, a final rounding to
integer occurs after any unit conversion.") and not in every time-related
setting that chooses to use something larger than microseconds. So, 30s is
'unit converted' up to 0.5 minutes (not explicitly explained...) then
rounded to zero (which is odd, half normally rounds up...).This happens because in this case rounding is done using rint(3),
which uses banker's rounding by default. While the rint(3)'s rounding
method
can be changed with fesetround(), PostgreSQL doesn't seem to change it.
So 0.5 rounds to 0, 1.5 and 2.5 round to 2, 3.5 and 4.5 round to 4, and so
on.
Right, a.k.a., half-even rounding. So, not odd.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
On Saturday, July 5, 2025, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Do you see any serious downside to switching the unit to seconds? I don't
think it introduces any serious issues. On the contrary, it gives users
finer
control over the timeout, and additionally works around the issue
that we're discussing here.
I do not, and would rather we make the change. Minutes are an
unconventional base unit for time in our world and should be avoided.
By my count, there are ten GUCs declared with GUC_UNIT_S and three
with GUC_UNIT_MIN. I'd say that there may be some lean towards
seconds but David's argument seems like pure hyperbole.
I'm kind of down on changing the unit, because it will *silently*
break configuration files where the value was set without a unit.
May I suggest an alternative? We could change the variable from int
to float type and continue to specify it in minutes. That will have
exactly zero compatibility impact, it allows sub-minute values to
be selected at need, and it removes the need for hair-splitting
documentation about what the rounding rules are.
We previously did the same with vacuum_cost_delay to avoid worries
about how to specify sub-millisecond precision for that. So the
infrastructure is already in place, I think. The patch will be
different from what is proposed but should need to touch pretty
much the same places.
regards, tom lane