Introduce XID age based replication slot invalidation
Hi folks,
I'd like to restart the discussion about providing an xid-based slot
invalidation mechanism. The previous effort [1]/messages/by-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com presented an XID and
time-based invalidation and the inactive time-based approach was
implemented first. The latest XID based patch from Bharath Rupireddy
can be found here [2]/messages/by-id/CALj2ACXe8+xSNdMXTMaSRWUwX7v61Ad4iddUwnn=djSwx3GLLg@mail.gmail.com.
When thinking about availability of the database, inactive replication
slots cause two main pain points:
1) WAL accumulation
2) Replication slots with xmin/catalog_xmin can hold back vacuuming
leading to wrap-around
The first issue can be mitigated by 'max_slot_wal_keep_size'. However
in the second case there are no good mechanisms to prioritize write
availability of the database and avoid wraparound. The new GUC
'idle_replication_slot_timeout' partially addresses the concern if you
have similar workloads. However it's hard to set the same setting
across a fleet of different applications.
It's easy to imagine a high-XID churning workload in one cluster while
another has large batch jobs where changes get synced out
periodically. There isn't a "one-size" fits all setting for
'idle_replication_slot_timeout' in these two cases.
The attached patch addresses this by introducing 'max_slot_xid_age' in
a similar fashion. Replication slots with transaction ID greater than
the set age will get invalidated allowing vacuum to proceed, biasing
towards database availability.
Invalidation happens in CHECKPOINT, similar to
'idle_replication_slot_timeout', and when VACUUM occurs.
The patch currently attempts to invalidate once-per-autovacuum worker.
We're wondering if it should attempt invalidation on a per-relation
basis within the vacuum call itself. That would account for scenarios
where the cost_delay or naptime is high between autovac executions.
Thanks,
John H
[1]: /messages/by-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
[2]: /messages/by-id/CALj2ACXe8+xSNdMXTMaSRWUwX7v61Ad4iddUwnn=djSwx3GLLg@mail.gmail.com
--
John Hsu - Amazon Web Services
Attachments:
0044-Add-XID-age-based-replication-slot-invalidation.patchapplication/octet-stream; name=0044-Add-XID-age-based-replication-slot-invalidation.patchDownload+436-10
Dear John,
The first issue can be mitigated by 'max_slot_wal_keep_size'. However
in the second case there are no good mechanisms to prioritize write
availability of the database and avoid wraparound. The new GUC
'idle_replication_slot_timeout' partially addresses the concern if you
have similar workloads. However it's hard to set the same setting
across a fleet of different applications.
IIUC, the feature can directly avoid the wraparound issue than other
invalidation mechanism. The motivation seems enough for me.
The patch currently attempts to invalidate once-per-autovacuum worker.
We're wondering if it should attempt invalidation on a per-relation
basis within the vacuum call itself. That would account for scenarios
where the cost_delay or naptime is high between autovac executions.
I have a concern that age calculation acquire the lock for XidGenLock thus
performance can be affected. Do you have insights for it?
Invalidation happens in CHECKPOINT, similar to
'idle_replication_slot_timeout', and when VACUUM occurs.
Let me confirm because I'm new. VACUUM can also trigger because old XID make
VACUUM fail, right? Timeout is aimed for WAL thus it is not so related with VACUUM,
which does not recycle segments.
In contrast, is there a possibility that XID-age check can be done only at VACUUM?
Regarding the patch, try_replication_slot_invalidation() and ReplicationSlotIsXIDAged()
do the same task. Can we reduce duplicated part?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hi Hayato,
Thank you for taking a look.
The patch currently attempts to invalidate once-per-autovacuum worker.
We're wondering if it should attempt invalidation on a per-relation
basis within the vacuum call itself. That would account for scenarios
where the cost_delay or naptime is high between autovac executions.I have a concern that age calculation acquire the lock for XidGenLock thus
performance can be affected. Do you have insights for it?
Are you concerned if we did the check on a per table case? Or in the
current situation
where it's only once per-worker.
Invalidation happens in CHECKPOINT, similar to
'idle_replication_slot_timeout', and when VACUUM occurs.Let me confirm because I'm new. VACUUM can also trigger because old XID make
VACUUM fail, right? Timeout is aimed for WAL thus it is not so related with VACUUM,
which does not recycle segments.
I feel that the timeout is used as a way to roughly address storage
accumulation or VACUUM
not progressing due to slots.
In contrast, is there a possibility that XID-age check can be done only at VACUUM?
It's also done in CHECKPOINT because there can be stale replication
slots on standby that
aren't there on writer. We would still want them to be invalidated.
Regarding the patch, try_replication_slot_invalidation() and ReplicationSlotIsXIDAged()
do the same task. Can we reduce duplicated part?
Thanks for catching, I thought I did this but guess not. Updated in
the latest attachment.
--
John Hsu - Amazon Web Services
Attachments:
0045-Add-XID-age-based-replication-slot-invalidation.patchapplication/octet-stream; name=0045-Add-XID-age-based-replication-slot-invalidation.patchDownload+411-10
Hi,
On Thu, Sep 18, 2025 at 10:20 AM John H <johnhyvr@gmail.com> wrote:
I'd like to restart the discussion about providing an xid-based slot
invalidation mechanism. The previous effort [1] presented an XID and
time-based invalidation and the inactive time-based approach was
implemented first. The latest XID based patch from Bharath Rupireddy
can be found here [2].When thinking about availability of the database, inactive replication
slots cause two main pain points:
1) WAL accumulation
2) Replication slots with xmin/catalog_xmin can hold back vacuuming
leading to wrap-aroundIt's easy to imagine a high-XID churning workload in one cluster while
another has large batch jobs where changes get synced out
periodically. There isn't a "one-size" fits all setting for
'idle_replication_slot_timeout' in these two cases.
+1.
The attached patch addresses this by introducing 'max_slot_xid_age' in
a similar fashion. Replication slots with transaction ID greater than
the set age will get invalidated allowing vacuum to proceed, biasing
towards database availability.Invalidation happens in CHECKPOINT, similar to
'idle_replication_slot_timeout', and when VACUUM occurs.The patch currently attempts to invalidate once-per-autovacuum worker.
We're wondering if it should attempt invalidation on a per-relation
basis within the vacuum call itself. That would account for scenarios
where the cost_delay or naptime is high between autovac executions.
IMO, computing XID horizons per-relation during vacuum is good. The
main reason we try to invalidate replication slots based on the XID
age in the vacuum path is to help the database when it needs it most -
when vacuum is computing the XID horizons. That said, it would be good
to have performance analysis with a large number of replication slots,
comparing once-per-relation vs. once-per-autovacuum worker vs.
once-per-autovacuum launcher wake-up cycle.
I haven't looked at the patch in depth, but it would be good to have a
TAP test with more realistic production workloads. We could set this
value to less than 1.5 billion and use xid_wraparound test to quickly
reach the wraparound limits, then verify if this setting can help
prevent the database from reaching wraparound errors. This approach
would also validate the age calculations in
try_replication_slot_invalidation with higher limits.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi John,
Thank you for sending in the rebased patch earlier. I will have some
cycles going forward and I would like to continue with this work.
Hi Kuroda-san,
Thank you for reviewing the patch.
On Fri, Sep 19, 2025 at 1:07 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
IIUC, the feature can directly avoid the wraparound issue than other
invalidation mechanism. The motivation seems enough for me.
That's correct. When enabled, replication slots whose XID age exceeds
the configured value get invalidated before vacuum computes the XID
horizons. This ensures that slots which would otherwise prevent vacuum
from freezing heap tuples don't come in the way of XID wraparound
prevention.
The patch currently attempts to invalidate once-per-autovacuum worker.
We're wondering if it should attempt invalidation on a per-relation
basis within the vacuum call itself. That would account for scenarios
where the cost_delay or naptime is high between autovac executions.I have a concern that age calculation acquire the lock for XidGenLock thus
performance can be affected. Do you have insights for it?
I made the following design choice: try invalidating only once per
vacuum cycle, not per table. While this keeps the cost of checking
(incl. the XidGenLock contention) for invalidation to a minimum when
there are a large number of tables and replication slots, it can be
less effective when individual tables/indexes are large. Invalidating
during checkpoints can help to some extent with the large table/index
cases. But I'm open to thoughts on this.
Please find the attached patch for further review. I fixed the XID age
calculation in ReplicationSlotIsXIDAged and adjusted the code
comments.
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Add-XID-age-based-replication-slot-invalidation.patchapplication/x-patch; name=v1-0001-Add-XID-age-based-replication-slot-invalidation.patchDownload+424-8
Hi Bharath,
Do you think we need different GUCs for catalog_xmin and xmin? If table
bloat is a concern (not catalog bloat), then logical slots are not required
to invalidate unless the cluster is close to wraparound.
I made the following design choice: try invalidating only once per
vacuum cycle, not per table. While this keeps the cost of checking
(incl. the XidGenLock contention) for invalidation to a minimum when
there are a large number of tables and replication slots, it can be
less effective when individual tables/indexes are large. Invalidating
during checkpoints can help to some extent with the large table/index
cases. But I'm open to thoughts on this.
It may not solve the intent when the vacuum cycle is longer, which one can
expect on a large database particularly when there is heavy bloat.
Please find the attached patch for further review. I fixed the XID age
calculation in ReplicationSlotIsXIDAged and adjusted the code
comments.
I applied the patch and all the tests passed. A few comments:
@@ -495,7 +525,7 @@ vacuum(List *relations, const VacuumParams params,
BufferAccessStrategy bstrateg
MemoryContext vac_context, bool isTopLevel)
{
static bool in_vacuum = false;
-
+ static bool first_time = true;
first_time variable is not self explanatory, maybe something like
try_replication_slot_invalidation and add comments that it will be set to
false after the first check?
+ if (TransactionIdIsValid(xmin))
+ appendStringInfo(&err_detail, _("The slot's xmin %u exceeds the maximum
xid age %d specified by \"max_slot_xid_age\"."),
+ xmin,
+ max_slot_xid_age);
Slot invalidates even when the age is max_slot_xid_age, isn't it?
Thanks,
Satya
Hi,
On Fri, Mar 20, 2026 at 11:29 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
Do you think we need different GUCs for catalog_xmin and xmin? If table bloat is a concern (not catalog bloat), then logical slots are not required to invalidate unless the cluster is close to wraparound.
IMO the main purpose of max_slot_xid_age is to prevent XID wraparound.
For bloat, I still think max_slot_wal_keep_size is the better choice.
Where max_slot_xid_age is really useful is when the vacuum can't
freeze because a replication slot (physical or logical) is holding
back the XID horizon and the system is getting close to wraparound.
Invalidating such a slot clears the way for vacuum. Setting
max_slot_xid_age above vacuum_failsafe_age allows vacuum to waste
cycles scanning tables it cannot freeze. Keeping max_slot_xid_age <=
vacuum_failsafe_age (default 1.6B) prevents this by invalidating the
slot before vacuum effort is wasted.
As far as XID wraparound is concerned, both xmin and catalog_xmin need
to be treated similarly. Either one can hold back freezing and push
the system toward wraparound. So I don't think we need separate GUCs
for xmin and catalog_xmin unless I'm missing something. One GUC
covering both keeps things simple.
I made the following design choice: try invalidating only once per
vacuum cycle, not per table. While this keeps the cost of checking
(incl. the XidGenLock contention) for invalidation to a minimum when
there are a large number of tables and replication slots, it can be
less effective when individual tables/indexes are large. Invalidating
during checkpoints can help to some extent with the large table/index
cases. But I'm open to thoughts on this.It may not solve the intent when the vacuum cycle is longer, which one can expect on a large database particularly when there is heavy bloat.
This design choice boils down to the following: a database instance
having either 1/ a large number of small tables or 2/ large tables.
From my experience, I have seen both cases but mostly case 2 (others
can correct me). In this context, having an XID age based slot
invalidation check once per relation makes sense. However, I'm open to
more thoughts here.
Please find the attached patch for further review. I fixed the XID age
calculation in ReplicationSlotIsXIDAged and adjusted the code
comments.I applied the patch and all the tests passed. A few comments:
Thank you for reviewing the patch.
@@ -495,7 +525,7 @@ vacuum(List *relations, const VacuumParams params, BufferAccessStrategy bstrateg MemoryContext vac_context, bool isTopLevel) { static bool in_vacuum = false; - + static bool first_time = true;first_time variable is not self explanatory, maybe something like try_replication_slot_invalidation and add comments that it will be set to false after the first check?
+1. Changed the variable name and simplified the comments around.
+ if (TransactionIdIsValid(xmin)) + appendStringInfo(&err_detail, _("The slot's xmin %u exceeds the maximum xid age %d specified by \"max_slot_xid_age\"."), + xmin, + max_slot_xid_age);Slot invalidates even when the age is max_slot_xid_age, isn't it?
Nice catch! I changed it to use TransactionIdPrecedes so it matches
the above error message like the two of the existing XID age GUCs
(autovacuum_freeze_max_age, vacuum_failsafe_age).
Please find the attached v2 patch for further review. Thank you!
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Add-XID-age-based-replication-slot-invalidation.patchapplication/x-patch; name=v2-0001-Add-XID-age-based-replication-slot-invalidation.patchDownload+432-8
Hi,
On Mon, Mar 23, 2026 at 9:00 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Hi,
On Fri, Mar 20, 2026 at 11:29 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:Do you think we need different GUCs for catalog_xmin and xmin? If table bloat is a concern (not catalog bloat), then logical slots are not required to invalidate unless the cluster is close to wraparound.
IMO the main purpose of max_slot_xid_age is to prevent XID wraparound.
For bloat, I still think max_slot_wal_keep_size is the better choice.Where max_slot_xid_age is really useful is when the vacuum can't
freeze because a replication slot (physical or logical) is holding
back the XID horizon and the system is getting close to wraparound.
Invalidating such a slot clears the way for vacuum. Setting
max_slot_xid_age above vacuum_failsafe_age allows vacuum to waste
cycles scanning tables it cannot freeze. Keeping max_slot_xid_age <=
vacuum_failsafe_age (default 1.6B) prevents this by invalidating the
slot before vacuum effort is wasted.As far as XID wraparound is concerned, both xmin and catalog_xmin need
to be treated similarly. Either one can hold back freezing and push
the system toward wraparound. So I don't think we need separate GUCs
for xmin and catalog_xmin unless I'm missing something. One GUC
covering both keeps things simple.
I've studied the discussion on this thread and the patch. I understand
the purpose of this feature and agree that it's useful especially in
cases where orphaned (physical or logical) replication slots prevent
the xmin from advancing and inactive_since based slot invalidation
might not fit.
And +1 for treating both the slot's xmin and catalog_xmin similarly
with the single GUC.
I made the following design choice: try invalidating only once per
vacuum cycle, not per table. While this keeps the cost of checking
(incl. the XidGenLock contention) for invalidation to a minimum when
there are a large number of tables and replication slots, it can be
less effective when individual tables/indexes are large. Invalidating
during checkpoints can help to some extent with the large table/index
cases. But I'm open to thoughts on this.It may not solve the intent when the vacuum cycle is longer, which one can expect on a large database particularly when there is heavy bloat.
This design choice boils down to the following: a database instance
having either 1/ a large number of small tables or 2/ large tables.
From my experience, I have seen both cases but mostly case 2 (others
can correct me). In this context, having an XID age based slot
invalidation check once per relation makes sense. However, I'm open to
more thoughts here.
ISTM that checking the XID-based slot invalidation per table would be
more bullet-proof and cover many cases. How about checking the
XID-based slot invalidation opportunity only when the OldestXmin is
older than the new GUC? For example, we can do this check in
heap_vacuum_rel() based on the VacuumCutoffs returned by
vacuum_get_cutoffs(). If we invalidate at least one slot for its XID,
we can re-compute the OldestXmin.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Mar 23, 2026 at 4:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've studied the discussion on this thread and the patch. I understand
the purpose of this feature and agree that it's useful especially in
cases where orphaned (physical or logical) replication slots prevent
the xmin from advancing and inactive_since based slot invalidation
might not fit.And +1 for treating both the slot's xmin and catalog_xmin similarly
with the single GUC.
Thanks for reviewing the patch.
I made the following design choice: try invalidating only once per
vacuum cycle, not per table. While this keeps the cost of checking
(incl. the XidGenLock contention) for invalidation to a minimum when
there are a large number of tables and replication slots, it can be
less effective when individual tables/indexes are large. Invalidating
during checkpoints can help to some extent with the large table/index
cases. But I'm open to thoughts on this.It may not solve the intent when the vacuum cycle is longer, which one can expect on a large database particularly when there is heavy bloat.
This design choice boils down to the following: a database instance
having either 1/ a large number of small tables or 2/ large tables.
From my experience, I have seen both cases but mostly case 2 (others
can correct me). In this context, having an XID age based slot
invalidation check once per relation makes sense. However, I'm open to
more thoughts here.ISTM that checking the XID-based slot invalidation per table would be
more bullet-proof and cover many cases. How about checking the
XID-based slot invalidation opportunity only when the OldestXmin is
older than the new GUC? For example, we can do this check in
heap_vacuum_rel() based on the VacuumCutoffs returned by
vacuum_get_cutoffs(). If we invalidate at least one slot for its XID,
we can re-compute the OldestXmin.
Agreed. Here's the patch that moves the XID-age based slot
invalidation check to vacuum_get_cutoffs. This has some nice
advantages: 1/ It makes the check once per table (to help with large
tables). 2/ It makes the check less costly since we rely on already
computed OldestXmin and nextXID values. 3/ It avoids the checkpointer
to do XID-age based slot invalidation which keeps the usage of this
GUC simple with no additional costs to the checkpointer - just the
vacuum (both vacuum command and autovacuum) does the invalidation when
needed.
I moved the new tests to the existing TAP test file
t/019_replslot_limit.pl alongside other invalidation tests.
I added detailed comments around InvalidateXIDAgedReplicationSlots and
slightly modified the docs.
Please find the v3 patch for further review.
PS: Thanks Sawada-san for the offlist chat.
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Add-XID-age-based-replication-slot-invalidation.patchapplication/octet-stream; name=v3-0001-Add-XID-age-based-replication-slot-invalidation.patchDownload+394-11
On Tue, Mar 24, 2026 at 2:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Hi,
On Mon, Mar 23, 2026 at 4:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've studied the discussion on this thread and the patch. I understand
the purpose of this feature and agree that it's useful especially in
cases where orphaned (physical or logical) replication slots prevent
the xmin from advancing and inactive_since based slot invalidation
might not fit.And +1 for treating both the slot's xmin and catalog_xmin similarly
with the single GUC.Thanks for reviewing the patch.
I made the following design choice: try invalidating only once per
vacuum cycle, not per table. While this keeps the cost of checking
(incl. the XidGenLock contention) for invalidation to a minimum when
there are a large number of tables and replication slots, it can be
less effective when individual tables/indexes are large. Invalidating
during checkpoints can help to some extent with the large table/index
cases. But I'm open to thoughts on this.It may not solve the intent when the vacuum cycle is longer, which one can expect on a large database particularly when there is heavy bloat.
This design choice boils down to the following: a database instance
having either 1/ a large number of small tables or 2/ large tables.
From my experience, I have seen both cases but mostly case 2 (others
can correct me). In this context, having an XID age based slot
invalidation check once per relation makes sense. However, I'm open to
more thoughts here.ISTM that checking the XID-based slot invalidation per table would be
more bullet-proof and cover many cases. How about checking the
XID-based slot invalidation opportunity only when the OldestXmin is
older than the new GUC? For example, we can do this check in
heap_vacuum_rel() based on the VacuumCutoffs returned by
vacuum_get_cutoffs(). If we invalidate at least one slot for its XID,
we can re-compute the OldestXmin.Agreed. Here's the patch that moves the XID-age based slot
invalidation check to vacuum_get_cutoffs. This has some nice
advantages: 1/ It makes the check once per table (to help with large
tables). 2/ It makes the check less costly since we rely on already
computed OldestXmin and nextXID values. 3/ It avoids the checkpointer
to do XID-age based slot invalidation which keeps the usage of this
GUC simple with no additional costs to the checkpointer - just the
vacuum (both vacuum command and autovacuum) does the invalidation when
needed.I moved the new tests to the existing TAP test file
t/019_replslot_limit.pl alongside other invalidation tests.I added detailed comments around InvalidateXIDAgedReplicationSlots and
slightly modified the docs.Please find the v3 patch for further review.
Thank you for updating the patch. I think the patch is reasonably
simple and can avoid unnecessary overheads well due to XID-based
checks. Here are some comments:
+ /*
+ * Try to invalidate XID-aged replication slots that may interfere with
+ * vacuum's ability to freeze and remove dead tuples. Since OldestXmin
+ * already covers the slot xmin/catalog_xmin values, pass it as a
+ * preliminary check to avoid additional iteration over all the slots.
+ *
+ * If at least one slot was invalidated, recompute OldestXmin so that this
+ * vacuum benefits from the advanced horizon immediately.
+ */
+ if (InvalidateXIDAgedReplicationSlots(cutoffs->OldestXmin, nextXID))
+ {
+ cutoffs->OldestXmin = GetOldestNonRemovableTransactionId(rel);
+ Assert(TransactionIdIsNormal(cutoffs->OldestXmin));
+ }
vacuum_get_cutoff() is also called by VACUUM FULL, CLUSTER, and
REPACK. I'm not sure that users would expect the slot invalidation
also in these commands. I think it's better to leave
vacuum_get_cutoff() a pure cutoff computation function and we can try
to invalidate slots in heap_vacuum_rel(). It requires additional
ReadNextTransactionId() but we can live with it, or we can make
vacuum_get_cutoffs() return the nextXID as well (stored in *cutoffs).
---
+ /* ensure it's a "normal" XID, else TransactionIdPrecedes misbehaves */
+ /* this can cause the limit to go backwards by 3, but that's OK */
+ if (!TransactionIdIsNormal(cutoffXID))
+ cutoffXID = FirstNormalTransactionId;
+
+ if (TransactionIdPrecedes(oldestXmin, cutoffXID))
+ {
+ invalidated = InvalidateObsoleteReplicationSlots(RS_INVAL_XID_AGE,
+ 0,
+ InvalidOid,
+ InvalidTransactionId,
+ nextXID);
+ }
I think it's better to check the procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin before iterating over each
slot. Otherwise, we would end up checking every slot even when a long
running transaction holds the oldestxmin back.
---
+ if (cutoffXID < FirstNormalTransactionId)
+ cutoffXID -= FirstNormalTransactionId;
and
+ if (!TransactionIdIsNormal(cutoffXID))
+ cutoffXID = FirstNormalTransactionId;
These codes have the same comment but are doing a slightly different
thing. I guess the latter is missing '-'?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
On Tue, Mar 24, 2026 at 11:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Please find the v3 patch for further review.
Thank you for updating the patch. I think the patch is reasonably
simple and can avoid unnecessary overheads well due to XID-based
checks. Here are some comments:
Thank you for reviewing the patch.
vacuum_get_cutoff() is also called by VACUUM FULL, CLUSTER, and
REPACK. I'm not sure that users would expect the slot invalidation
also in these commands. I think it's better to leave
vacuum_get_cutoff() a pure cutoff computation function and we can try
to invalidate slots in heap_vacuum_rel(). It requires additional
ReadNextTransactionId() but we can live with it, or we can make
vacuum_get_cutoffs() return the nextXID as well (stored in *cutoffs).
+1. I chose to perform the slot invalidation in heap_vacuum_rel by
getting the next txn ID and calling vacuum_get_cutoffs again when a
slot gets invalidated. IMHO, this is simple than adding a flag and do
the invalidation selectively in vacuum_get_cutoffs.
if (TransactionIdPrecedes(oldestXmin, cutoffXID)) + { + invalidated = InvalidateObsoleteReplicationSlots(RS_INVAL_XID_AGE, + 0, + InvalidOid, + InvalidTransactionId, + nextXID); + }I think it's better to check the procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin before iterating over each
slot. Otherwise, we would end up checking every slot even when a long
running transaction holds the oldestxmin back.
+1. Changed.
+ if (!TransactionIdIsNormal(cutoffXID)) + cutoffXID = FirstNormalTransactionId;These codes have the same comment but are doing a slightly different
thing. I guess the latter is missing '-'?
Fixed the typo.
I fixed a test error being reported in CI.
Please find the attached v4 patch for further review.
I've also attached the 0002 patch that adds a test case to demo a
production-like scenario by pushing the database to XID wraparound
limits and checking if the XID-age based invalidation with the GUC
setting at the default vacuum_failsafe_age of 1.6B works correctly,
and whether autovacuum can successfully remove this replication slot
blocker to proceed with freezing and bring the database back to
normal. I don't intend to get this committed unless others think
otherwise, but I wanted to have this as a reference.
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0001-Add-XID-age-based-replication-slot-invalidation.patchapplication/x-patch; name=v4-0001-Add-XID-age-based-replication-slot-invalidation.patchDownload+402-11
v4-0002-Add-more-tests-for-XID-age-slot-invalidation.patchapplication/x-patch; name=v4-0002-Add-more-tests-for-XID-age-slot-invalidation.patchDownload+164-2
On Wed, Mar 25, 2026 at 12:17 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Hi,
On Tue, Mar 24, 2026 at 11:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:Please find the v3 patch for further review.
Thank you for updating the patch. I think the patch is reasonably
simple and can avoid unnecessary overheads well due to XID-based
checks. Here are some comments:Thank you for reviewing the patch.
vacuum_get_cutoff() is also called by VACUUM FULL, CLUSTER, and
REPACK. I'm not sure that users would expect the slot invalidation
also in these commands. I think it's better to leave
vacuum_get_cutoff() a pure cutoff computation function and we can try
to invalidate slots in heap_vacuum_rel(). It requires additional
ReadNextTransactionId() but we can live with it, or we can make
vacuum_get_cutoffs() return the nextXID as well (stored in *cutoffs).+1. I chose to perform the slot invalidation in heap_vacuum_rel by
getting the next txn ID and calling vacuum_get_cutoffs again when a
slot gets invalidated. IMHO, this is simple than adding a flag and do
the invalidation selectively in vacuum_get_cutoffs.if (TransactionIdPrecedes(oldestXmin, cutoffXID)) + { + invalidated =InvalidateObsoleteReplicationSlots(RS_INVAL_XID_AGE,
+ 0, + InvalidOid, +InvalidTransactionId,
+ nextXID); + }I think it's better to check the procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin before iterating over each
slot. Otherwise, we would end up checking every slot even when a long
running transaction holds the oldestxmin back.+1. Changed.
+ if (!TransactionIdIsNormal(cutoffXID)) + cutoffXID = FirstNormalTransactionId;These codes have the same comment but are doing a slightly different
thing. I guess the latter is missing '-'?Fixed the typo.
I fixed a test error being reported in CI.
Please find the attached v4 patch for further review.
InvalidateObsoleteReplicationSlots(uint32 possible_causes,
XLogSegNo oldestSegno, Oid dboid,
- TransactionId snapshotConflictHorizon)
+ TransactionId snapshotConflictHorizon, TransactionId nextXID)
May be add TransactionId nextXID in a new line?
Thinking loud, vacuum doesn't run on a hot_standby, that means this GUC is
not applicable for hot_standby. Is this intended? Why not call during
checkpoint/restorepoint itself like other slot invalidation checks?
Thanks,
Satya
Hi,
On Wed, Mar 25, 2026 at 12:17 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:
Hi,
On Tue, Mar 24, 2026 at 11:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:Please find the v3 patch for further review.
Thank you for updating the patch. I think the patch is reasonably
simple and can avoid unnecessary overheads well due to XID-based
checks. Here are some comments:Thank you for reviewing the patch.
vacuum_get_cutoff() is also called by VACUUM FULL, CLUSTER, and
REPACK. I'm not sure that users would expect the slot invalidation
also in these commands. I think it's better to leave
vacuum_get_cutoff() a pure cutoff computation function and we can try
to invalidate slots in heap_vacuum_rel(). It requires additional
ReadNextTransactionId() but we can live with it, or we can make
vacuum_get_cutoffs() return the nextXID as well (stored in *cutoffs).+1. I chose to perform the slot invalidation in heap_vacuum_rel by
getting the next txn ID and calling vacuum_get_cutoffs again when a
slot gets invalidated. IMHO, this is simple than adding a flag and do
the invalidation selectively in vacuum_get_cutoffs.if (TransactionIdPrecedes(oldestXmin, cutoffXID)) + { + invalidated =InvalidateObsoleteReplicationSlots(RS_INVAL_XID_AGE,
+ 0, + InvalidOid, +InvalidTransactionId,
+ nextXID); + }I think it's better to check the procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin before iterating over each
slot. Otherwise, we would end up checking every slot even when a long
running transaction holds the oldestxmin back.+1. Changed.
+ if (!TransactionIdIsNormal(cutoffXID)) + cutoffXID = FirstNormalTransactionId;These codes have the same comment but are doing a slightly different
thing. I guess the latter is missing '-'?Fixed the typo.
I fixed a test error being reported in CI.
Please find the attached v4 patch for further review.
+ if (InvalidateXIDAgedReplicationSlots(vacrel->cutoffs.OldestXmin,
+ ReadNextTransactionId()))
Does this account catalog xmin for data tables?
Thanks,
Satya
On Wed, Mar 25, 2026 at 12:17 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Hi,
On Tue, Mar 24, 2026 at 11:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Please find the v3 patch for further review.
Thank you for updating the patch. I think the patch is reasonably
simple and can avoid unnecessary overheads well due to XID-based
checks. Here are some comments:Thank you for reviewing the patch.
vacuum_get_cutoff() is also called by VACUUM FULL, CLUSTER, and
REPACK. I'm not sure that users would expect the slot invalidation
also in these commands. I think it's better to leave
vacuum_get_cutoff() a pure cutoff computation function and we can try
to invalidate slots in heap_vacuum_rel(). It requires additional
ReadNextTransactionId() but we can live with it, or we can make
vacuum_get_cutoffs() return the nextXID as well (stored in *cutoffs).+1. I chose to perform the slot invalidation in heap_vacuum_rel by
getting the next txn ID and calling vacuum_get_cutoffs again when a
slot gets invalidated. IMHO, this is simple than adding a flag and do
the invalidation selectively in vacuum_get_cutoffs.if (TransactionIdPrecedes(oldestXmin, cutoffXID)) + { + invalidated = InvalidateObsoleteReplicationSlots(RS_INVAL_XID_AGE, + 0, + InvalidOid, + InvalidTransactionId, + nextXID); + }I think it's better to check the procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin before iterating over each
slot. Otherwise, we would end up checking every slot even when a long
running transaction holds the oldestxmin back.+1. Changed.
+ if (!TransactionIdIsNormal(cutoffXID)) + cutoffXID = FirstNormalTransactionId;These codes have the same comment but are doing a slightly different
thing. I guess the latter is missing '-'?Fixed the typo.
I fixed a test error being reported in CI.
Please find the attached v4 patch for further review.
Thank you for updating the patch. I've reviewed the patch and have
some review comments:
+ /* translator: %s is a GUC variable name */
+ appendStringInfo(&err_detail, _("The slot's xmin
%u at next transaction ID %u exceeds the age %d specified by
\"%s\"."),
+ xmin,
+ nextXID,
+ max_slot_xid_age,
+ "max_slot_xid_age");
I think it's better to show the age of the slot's xmin instead of the
recent XID.
---
+
+ if (!TransactionIdIsNormal(oldestXmin) || !TransactionIdIsNormal(nextXID))
+ return false;
+
Do we expect that the passed oldestXmin or nextXID could be non-normal
XIDs? I think the function assumes these are valid XIDs.
Also, since this function is called only by heap_vacuum_rel(), we can
call ReadNextTransactionId() within this function.
---
+ if (IsReplicationSlotXIDAged(slot_xmin, slot_catalog_xmin, nextXID))
We compute the cutoff XID in IsReplicationSlotXIDAged() again, which
seems redundant.
I've attached the fixup patch addressing these comments and having
some code cleanups. Please review it.
I'm reviewing the regression test part, and will share review comments soon.
I've also attached the 0002 patch that adds a test case to demo a
production-like scenario by pushing the database to XID wraparound
limits and checking if the XID-age based invalidation with the GUC
setting at the default vacuum_failsafe_age of 1.6B works correctly,
and whether autovacuum can successfully remove this replication slot
blocker to proceed with freezing and bring the database back to
normal. I don't intend to get this committed unless others think
otherwise, but I wanted to have this as a reference.
Thank you for sharing the test script! I'll check it as well.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v4_cleanup_masahiko.patchtext/x-patch; charset=US-ASCII; name=v4_cleanup_masahiko.patchDownload+89-116
Hi,
On Thu, Mar 26, 2026 at 2:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thank you for updating the patch. I've reviewed the patch and have
some review comments:
Thank you for reviewing the patch.
+ /* translator: %s is a GUC variable name */ + appendStringInfo(&err_detail, _("The slot's xmin %u at next transaction ID %u exceeds the age %d specified by \"%s\"."), + xmin, + nextXID, + max_slot_xid_age, + "max_slot_xid_age");I think it's better to show the age of the slot's xmin instead of the
recent XID.
Agreed.
--- + + if (!TransactionIdIsNormal(oldestXmin) || !TransactionIdIsNormal(nextXID)) + return false; +Do we expect that the passed oldestXmin or nextXID could be non-normal
XIDs? I think the function assumes these are valid XIDs.
The oldestXmin is now removed. Please see the responses at the end.
Also, since this function is called only by heap_vacuum_rel(), we can
call ReadNextTransactionId() within this function.
Agreed.
--- + if (IsReplicationSlotXIDAged(slot_xmin, slot_catalog_xmin, nextXID))We compute the cutoff XID in IsReplicationSlotXIDAged() again, which
seems redundant.I've attached the fixup patch addressing these comments and having
some code cleanups. Please review it.
The fixup patch looked good to me, I had that merged in the attached v5 patch.
I'm reviewing the regression test part, and will share review comments soon.
I've also attached the 0002 patch that adds a test case to demo a
production-like scenario by pushing the database to XID wraparound
limits and checking if the XID-age based invalidation with the GUC
setting at the default vacuum_failsafe_age of 1.6B works correctly,
and whether autovacuum can successfully remove this replication slot
blocker to proceed with freezing and bring the database back to
normal. I don't intend to get this committed unless others think
otherwise, but I wanted to have this as a reference.Thank you for sharing the test script! I'll check it as well.
Thank you.
On Thu, Mar 26, 2026 at 3:42 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
Hi,
+ if (InvalidateXIDAgedReplicationSlots(vacrel->cutoffs.OldestXmin, + ReadNextTransactionId()))Does this account catalog xmin for data tables?
Nice catch! When vacuum runs on regular tables, it doesn't cover
catalog_xmin in the OldestXmin. So if catalog_xmin is blocking
relfrozenxid advancement, slot invalidation doesn't happen. I updated
vacuum_get_cutoffs to return slot_catalog_xmin and slot_xmin. These
values are already available in ComputeXidHorizons, so this doesn't
require an additional proc-array lock.
I also added support for XID age based slot invalidation during
checkpoints. This helps standbys that can have replication slots but
where vacuum doesn't run. (It skips synced slots, just like
idle_replication_slot_timeout does.)
Please find the attached v5 patches for further review. Thank you!
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0001-Add-XID-age-based-replication-slot-invalidation.patchapplication/x-patch; name=v5-0001-Add-XID-age-based-replication-slot-invalidation.patchDownload+435-16
v5-0002-Add-more-tests-for-XID-age-slot-invalidation.patchapplication/x-patch; name=v5-0002-Add-more-tests-for-XID-age-slot-invalidation.patchDownload+164-2
Hello,
Thanks for the v5 patch set, I have reviewed and did initial testing on
v5 patch set, and it LGTM, except these
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 286f0f46341..c2ff7e464f0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1849,7 +1849,7 @@
ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
else
{
/* translator: %s is a GUC variable
name */
- appendStringInfo(&err_detail,
_("The slot's xmin %u is %d transactions old, which exceeds the configured
\"%s\" value of %d."),
+ appendStringInfo(&err_detail,
_("The slot's catalog_xmin %u is %d transactions old, which exceeds the
configured \"%s\" value of %d."),
catalog_xmin, (int32) (recentXid - catalog_xmin), "max_slot_xid_age",
max_slot_xid_age);
}
while testing the active slot XID age invalidation (SIGTERM path) , i
observed that slot got invalidated , walsender was killed because of
SIGTERM , then starts the infinite-retry-cycle problem where
walreceiver starts walsender and walsender will try to use an invalidated
slot and dies, will think more on this.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hi,
On Sun, Mar 29, 2026 at 1:16 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
Hello,
Thanks for the v5 patch set, I have reviewed and did initial testing on
v5 patch set, and it LGTM, except these
Thank you for reviewing and testing. I appreciate it.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 286f0f46341..c2ff7e464f0 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1849,7 +1849,7 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause, else { /* translator: %s is a GUC variable name */ - appendStringInfo(&err_detail, _("The slot's xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."), + appendStringInfo(&err_detail, _("The slot's catalog_xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."), catalog_xmin, (int32) (recentXid - catalog_xmin), "max_slot_xid_age", max_slot_xid_age); }
Fixed the typo.
while testing the active slot XID age invalidation (SIGTERM path) , i
observed that slot got invalidated , walsender was killed because of
SIGTERM , then starts the infinite-retry-cycle problem where
walreceiver starts walsender and walsender will try to use an invalidated
slot and dies, will think more on this.
I would like to clarify that once a slot is invalidated due to any of
the reasons (ReplicationSlotInvalidationCause), it becomes unusable;
the sender will error out if the receiver tries to use it. This is
consistent with all existing slot invalidation mechanisms.
Please find the attached v6 patches fixing the typo for further review.
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Attachments:
v6-0002-Add-more-tests-for-XID-age-slot-invalidation.patchapplication/octet-stream; name=v6-0002-Add-more-tests-for-XID-age-slot-invalidation.patchDownload+164-2
v6-0001-Add-XID-age-based-replication-slot-invalidation.patchapplication/octet-stream; name=v6-0001-Add-XID-age-based-replication-slot-invalidation.patchDownload+435-16
On Sun, Mar 29, 2026 at 6:35 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Hi,
On Sun, Mar 29, 2026 at 1:16 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:Hello,
Thanks for the v5 patch set, I have reviewed and did initial testing on
v5 patch set, and it LGTM, except theseThank you for reviewing and testing. I appreciate it.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 286f0f46341..c2ff7e464f0 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1849,7 +1849,7 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause, else { /* translator: %s is a GUC variable name */ - appendStringInfo(&err_detail, _("The slot's xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."), + appendStringInfo(&err_detail, _("The slot's catalog_xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."), catalog_xmin, (int32) (recentXid - catalog_xmin), "max_slot_xid_age", max_slot_xid_age); }Fixed the typo.
while testing the active slot XID age invalidation (SIGTERM path) , i
observed that slot got invalidated , walsender was killed because of
SIGTERM , then starts the infinite-retry-cycle problem where
walreceiver starts walsender and walsender will try to use an invalidated
slot and dies, will think more on this.I would like to clarify that once a slot is invalidated due to any of
the reasons (ReplicationSlotInvalidationCause), it becomes unusable;
the sender will error out if the receiver tries to use it. This is
consistent with all existing slot invalidation mechanisms.Please find the attached v6 patches fixing the typo for further review.
I've reviewed the v6 patch. Here are some comments.
bool
vacuum_get_cutoffs(Relation rel, const VacuumParams params,
- struct VacuumCutoffs *cutoffs)
+ struct VacuumCutoffs *cutoffs,
+ TransactionId *slot_xmin,
+ TransactionId *slot_catalog_xmin)
How about storing both slot_xmin and catalog_xmin into VacuumCutoffs?
---
- if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED |
RS_INVAL_IDLE_TIMEOUT,
+ possibleInvalidationCauses = RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT |
+ RS_INVAL_XID_AGE;
+
+ if (InvalidateObsoleteReplicationSlots(possibleInvalidationCauses,
_logSegNo, InvalidOid,
+ InvalidTransactionId,
+ max_slot_xid_age > 0 ?
+ ReadNextTransactionId() :
InvalidTransactionId))
It's odd to me that we specify RS_INVAL_XID_AGE while passing
InvalidTransactionId. I think we can specify RS_INVAL_XID_AGE along
with a valid recentXId only when we'd like to check the slots based on
their XIDs.
---
+ /* Check if the slot needs to be invalidated due to max_slot_xid_age GUC */
+ if ((possible_causes & RS_INVAL_XID_AGE) && CanInvalidateXidAgedSlot(s))
+ {
+ TransactionId xidLimit;
+
+ Assert(TransactionIdIsValid(recentXid));
+
+ xidLimit = TransactionIdRetreatedBy(recentXid, max_slot_xid_age);
+
I think we can avoid calculating xidLimit for every slot by
calculating it in InvalidatePossiblyObsoleteSlot() and passing it to
DetermineSlotInvalidationCause().
---
*/
TransactionId
GetOldestNonRemovableTransactionId(Relation rel)
+{
+ return GetOldestNonRemovableTransactionIdExt(rel, NULL, NULL);
+}
+
+/*
+ * Same as GetOldestNonRemovableTransactionId(), but also returns the
+ * replication slot xmin and catalog_xmin from the same ComputeXidHorizons()
+ * call. This avoids a separate ProcArrayLock acquisition when the caller
+ * needs both values.
+ */
+TransactionId
+GetOldestNonRemovableTransactionIdExt(Relation rel,
+ TransactionId *slot_xmin,
+ TransactionId *slot_catalog_xmin)
{
I understand that the primary reason why the patch introduces another
variant of GetOldestNonRemovableTransactionId() is to avoid extra
ProcArrayLock acquision to get replication slot xmin and catalog_xmin.
While it's not very elegant, I find that it would not be bad because
otherwise autovacuum takes extra ProcArrayLock (in shared mode) for
every table to vacuum. The ProcArrayLock is already known
high-contented lock it would be better to avoid taking it once more.
If others think differently, we can just call
ProcArrayGetReplicationSlotXmin() separately and compare them to the
limit of XID-age based slot invalidation.
Having said that, I personally don't want to add new instructions to
the existing GetOldestNonRemovableTransactionId(). I guess we might
want to make both the existing function and new function call a common
(inline) function that takes ComputeXidHorizonsResult and returns
appropriate transaction id based on the given relation .
---
+ # Do some work to advance xids
+ $node->safe_psql(
+ 'postgres', qq[
+ do \$\$
+ begin
+ for i in 1..$nxids loop
+ -- use an exception block so that each iteration eats an XID
+ begin
+ insert into $table_name values (i);
+ exception
+ when division_by_zero then null;
+ end;
+ end loop;
+ end\$\$;
+ ]);
I think it's fater to use pg_current_xact_id() instead.
---
+ else
+ {
+ $node->safe_psql('postgres', "VACUUM");
+ }
We don't need to vacuum all tables here.
---
+# Configure primary with XID age settings. Set autovacuum_naptime high so
+# that the checkpointer (not vacuum) triggers the invalidation.
+my $max_slot_xid_age = 500;
+$primary5->append_conf(
+ 'postgresql.conf', qq{
+max_slot_xid_age = $max_slot_xid_age
+autovacuum_naptime = '1h'
+});
I think that it's better to disable autovacuum than setting a large number.
---
+# Testcase end: Invalidate streaming standby's slot due to max_slot_xid_age
+# GUC (via checkpoint).
I think that we can say "physical slot" instead of standby's slot to
avoid confusion as I thought standby's slot is a slot created on the
standby at the first glance.
---
Do we have tests for invalidating slots on the standbys?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Dear Bharath,
Thanks for re-working the project.
While seeing the old discussion, I found that Robert Haas was agaist the XID-based
invalidation, because it's difficult to determine the cutoff age [1]/messages/by-id/CA+TgmoZTbaaEjSZUG1FL0mzxAdN3qmXksO3O9_PZhEuXTkVnRQ@mail.gmail.com.
Can you clarify your thought against the point? Are you focusing on solving the
wraparound issues, not for bloated instance issue?
The code may not be accepted unless we got his agreement.
[1]: /messages/by-id/CA+TgmoZTbaaEjSZUG1FL0mzxAdN3qmXksO3O9_PZhEuXTkVnRQ@mail.gmail.com
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hi,
On Tue, Mar 31, 2026 at 12:25 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Bharath,
Thanks for re-working the project.
Thank you for looking into this.
While seeing the old discussion, I found that Robert Haas was agaist the XID-based
invalidation, because it's difficult to determine the cutoff age [1].
Can you clarify your thought against the point? Are you focusing on solving the
wraparound issues, not for bloated instance issue?
The code may not be accepted unless we got his agreement.[1]: /messages/by-id/CA+TgmoZTbaaEjSZUG1FL0mzxAdN3qmXksO3O9_PZhEuXTkVnRQ@mail.gmail.com
I summarized what others (Nathan, Robert, Amit, Alvaro, Bertrand) said
about it here with my responses:
/messages/by-id/CALj2ACVY+Fd5vC0VjW=5VDK9mmt-Y+PDZxnBp8ngGAZc24Vv9g@mail.gmail.com.
Please have a look.
A good setting for this in production scenarios is to set
max_slot_xid_age to vacuum_failsafe_age (1.6B) or little less, so that
autovacuum invalidates the slot before entering failsafe mode,
unblocking datfrozenxid advancement and avoiding XID wraparound
without manual VACUUM or downtime. I added a test for this in the 0002
patch. Please have a look.
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com