BF mamba failure

Started by Amit Kapilaalmost 3 years ago30 messages
#1Amit Kapila
amit.kapila16@gmail.com

Hi,

Peter Smith has recently reported a BF failure [1]/messages/by-id/CAHut+PsHdWFjU43VEX+R-8de6dFQ-_JWrsqs=vWek1hULexP4Q@mail.gmail.com. AFAICS, the call
stack of failure [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-03-17%2005%3A36%3A10 is as follows:

0x1e66644 <ExceptionalCondition+0x8c> at postgres
0x1d0143c <pgstat_release_entry_ref+0x4c0> at postgres
0x1d02534 <pgstat_get_entry_ref+0x780> at postgres
0x1cfb120 <pgstat_prep_pending_entry+0x8c> at postgres
0x1cfd590 <pgstat_report_disconnect+0x34> at postgres
0x1cfbfc0 <pgstat_shutdown_hook+0xd4> at postgres
0x1ca7b08 <shmem_exit+0x7c> at postgres
0x1ca7c74 <proc_exit_prepare+0x70> at postgres
0x1ca7d2c <proc_exit+0x18> at postgres
0x1cdf060 <PostgresMain+0x584> at postgres
0x1c203f4 <ServerLoop.isra.0+0x1e88> at postgres
0x1c2161c <PostmasterMain+0xfa4> at postgres
0x1edcf94 <main+0x254> at postgres

I couldn't correlate it to the recent commits. Any thoughts?

[1]: /messages/by-id/CAHut+PsHdWFjU43VEX+R-8de6dFQ-_JWrsqs=vWek1hULexP4Q@mail.gmail.com
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2023-03-17%2005%3A36%3A10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2023-03-17%2005%3A36%3A10

--
With Regards,
Amit Kapila.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#1)
Re: BF mamba failure

Amit Kapila <amit.kapila16@gmail.com> writes:

Peter Smith has recently reported a BF failure [1]. AFAICS, the call
stack of failure [2] is as follows:

Note the assertion report a few lines further up:

TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004

regards, tom lane

#3Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#2)
1 attachment(s)
Re: BF mamba failure

Hi,

18.03.2023 07:26, Tom Lane wrote:

Amit Kapila<amit.kapila16@gmail.com> writes:

Peter Smith has recently reported a BF failure [1]. AFAICS, the call
stack of failure [2] is as follows:

Note the assertion report a few lines further up:

TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004

This assertion failure can be reproduced easily with the attached patch:
============== running regression test queries        ==============
test oldest_xmin                  ... ok           55 ms
test oldest_xmin                  ... FAILED (test process exited with exit code 1)      107 ms
test oldest_xmin                  ... FAILED (test process exited with exit code 1)        8 ms
============== shutting down postmaster               ==============

contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 561,
PID: 456844

With the sleep placed above Assert(entry_ref->shared_entry->dropped) this Assert fails too.

Best regards,
Alexander

Attachments:

pgstat_shmem-debugging.patchtext/x-patch; charset=UTF-8; name=pgstat_shmem-debugging.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c7ce603706..9b3389a90e 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot
+ISOLATION = oldest_xmin oldest_xmin oldest_xmin
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 09fffd0e82..5307116670 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -551,6 +551,7 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 			/* only dropped entries can reach a 0 refcount */
 			Assert(entry_ref->shared_entry->dropped);
 
+pg_usleep(100000L);
 			shent = dshash_find(pgStatLocal.shared_hash,
 								&entry_ref->shared_entry->key,
 								true);
#4Peter Smith
smithpb2250@gmail.com
In reply to: Alexander Lakhin (#3)
Re: BF mamba failure

On Sun, Mar 19, 2023 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Hi,

18.03.2023 07:26, Tom Lane wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Peter Smith has recently reported a BF failure [1]. AFAICS, the call
stack of failure [2] is as follows:

Note the assertion report a few lines further up:

TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004

This assertion failure can be reproduced easily with the attached patch:
============== running regression test queries ==============
test oldest_xmin ... ok 55 ms
test oldest_xmin ... FAILED (test process exited with exit code 1) 107 ms
test oldest_xmin ... FAILED (test process exited with exit code 1) 8 ms
============== shutting down postmaster ==============

contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 561, PID: 456844

With the sleep placed above Assert(entry_ref->shared_entry->dropped) this Assert fails too.

Best regards,
Alexander

I used a slightly modified* patch of Alexander's [1] applied to the
latest HEAD code (but with my "toptxn" patch reverted).
--- the patch was modified in that I injected 'sleep' both above and
below the Assert(entry_ref->shared_entry->dropped).

Using this I was also able to reproduce the problem. But test failures
were rare. The make check-world seemed OK, and indeed the
test_decoding tests would also appear to PASS around 14 out of 15
times.

============== running regression test queries ==============
test oldest_xmin ... ok 342 ms
test oldest_xmin ... ok 121 ms
test oldest_xmin ... ok 283 ms
============== shutting down postmaster ==============
============== removing temporary instance ==============

=====================
All 3 tests passed.
=====================

~~

Often (but not always) depite the test_decoding reported PASS all 3
tests as "ok", I still observed there was a TRAP in the logfile
(contrib/test_decoding/output_iso/log/postmaster.log).
TRAP: failed Assert("entry_ref->shared_entry->dropped")

~~

Occasionally (about 1 in 15 test runs) the test would fail the same
way as described by Alexander [1]/messages/by-id/1941b7e2-be7c-9c4c-8505-c0fd05910e9a@gmail.com, with the accompanying TRAP.
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount)
== 0"), File: "pgstat_shmem.c", Line: 562, PID: 32013

============== running regression test queries ==============
test oldest_xmin ... ok 331 ms
test oldest_xmin ... ok 91 ms
test oldest_xmin ... FAILED 702 ms
============== shutting down postmaster ==============

======================
1 of 3 tests failed.
======================

~~

FWIW, the "toptxn" patch. whose push coincided with the build-farm
error I first reported [2]/messages/by-id/CAHut+PsHdWFjU43VEX+R-8de6dFQ-_JWrsqs=vWek1hULexP4Q@mail.gmail.com, turns out to be an innocent party in this
TRAP. We know this because all of the above results were running using
HEAD code but with that "toptxn" patch reverted.

------
[1]: /messages/by-id/1941b7e2-be7c-9c4c-8505-c0fd05910e9a@gmail.com
[2]: /messages/by-id/CAHut+PsHdWFjU43VEX+R-8de6dFQ-_JWrsqs=vWek1hULexP4Q@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#5Alexander Lakhin
exclusion@gmail.com
In reply to: Peter Smith (#4)
1 attachment(s)
Re: BF mamba failure

Hello hackers,

20.03.2023 09:10, Peter Smith wrote:

Using this I was also able to reproduce the problem. But test failures
were rare. The make check-world seemed OK, and indeed the
test_decoding tests would also appear to PASS around 14 out of 15
times.

I've stumbled upon this assertion failure again during testing following cd312adc5.

This time I've simplified the reproducer to the attached modification.
With this patch applied, `make -s check -C contrib/test_decoding` fails on master as below:
ok 1         - pgstat_rc_1                                14 ms
not ok 2     - pgstat_rc_2                              1351 ms

contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 562,
PID: 1130928

With extra logging added, I see the following events happening:
1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets
  ReplicationSlotIndex(slot) = 0 and calls
  pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0).

2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...)
  and then calls pgstat_gc_entry_refs on shmem_exit() ->
  pgstat_shutdown_hook() ...;
  with the sleep added inside pgstat_release_entry_ref, this backend waits
  after decreasing entry_ref->shared_entry->refcount to 0.

3) pgstat_rc_1.stop removes the replication slot.

4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets
  ReplicationSlotIndex(slot) = 0 and calls
  pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0),
  which leads to the call pgstat_reinit_entry(), which increases refcount
  for the same shared_entry as in (1) and (2), and then to the call
  pgstat_acquire_entry_ref(), which increases refcount once more.

5) the backend 2 reaches
Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0),
  which fails due to refcount = 2.

Best regards,
Alexander

Attachments:

pgstat_rc-debugging.patchtext/x-patch; charset=UTF-8; name=pgstat_rc-debugging.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c7ce603706..660eea87eb 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -6,10 +6,12 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
-ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
+XISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot
 
+ISOLATION = pgstat_rc_1 pgstat_rc_2
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/pgstat_rc_1.out b/contrib/test_decoding/expected/pgstat_rc_1.out
new file mode 100644
index 0000000000..850adf9930
--- /dev/null
+++ b/contrib/test_decoding/expected/pgstat_rc_1.out
@@ -0,0 +1,18 @@
+Parsed test spec with 1 sessions
+
+starting permutation: s0_get_changes
+?column?
+--------
+init    
+(1 row)
+
+step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data
+----
+(0 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/expected/pgstat_rc_2.out b/contrib/test_decoding/expected/pgstat_rc_2.out
new file mode 100644
index 0000000000..f8d4ce83a6
--- /dev/null
+++ b/contrib/test_decoding/expected/pgstat_rc_2.out
@@ -0,0 +1,19 @@
+Parsed test spec with 1 sessions
+
+starting permutation: s0_sleep
+?column?
+--------
+init    
+(1 row)
+
+step s0_sleep: SELECT pg_sleep(1);
+pg_sleep
+--------
+        
+(1 row)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/pgstat_rc_1.spec b/contrib/test_decoding/specs/pgstat_rc_1.spec
new file mode 100644
index 0000000000..7b08704a55
--- /dev/null
+++ b/contrib/test_decoding/specs/pgstat_rc_1.spec
@@ -0,0 +1,14 @@
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact
+}
+
+teardown
+{
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+permutation "s0_get_changes"
diff --git a/contrib/test_decoding/specs/pgstat_rc_2.spec b/contrib/test_decoding/specs/pgstat_rc_2.spec
new file mode 100644
index 0000000000..4e454ae8e8
--- /dev/null
+++ b/contrib/test_decoding/specs/pgstat_rc_2.spec
@@ -0,0 +1,14 @@
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact
+}
+
+teardown
+{
+	SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+step "s0_sleep" { SELECT pg_sleep(1); }
+
+permutation "s0_sleep"
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index b1c9437edc..fe517d85af 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -551,6 +551,7 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 			/* only dropped entries can reach a 0 refcount */
 			Assert(entry_ref->shared_entry->dropped);
 
+pg_usleep(100000L);
 			shent = dshash_find(pgStatLocal.shared_hash,
 								&entry_ref->shared_entry->key,
 								true);
#6Michael Paquier
michael@paquier.xyz
In reply to: Alexander Lakhin (#5)
Re: BF mamba failure

On Wed, Jun 12, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:

With extra logging added, I see the following events happening:
1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets
  ReplicationSlotIndex(slot) = 0 and calls
  pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0).

2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...)
  and then calls pgstat_gc_entry_refs on shmem_exit() ->
  pgstat_shutdown_hook() ...;
  with the sleep added inside pgstat_release_entry_ref, this backend waits
  after decreasing entry_ref->shared_entry->refcount to 0.

3) pgstat_rc_1.stop removes the replication slot.

4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets
  ReplicationSlotIndex(slot) = 0 and calls
  pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0),
  which leads to the call pgstat_reinit_entry(), which increases refcount
  for the same shared_entry as in (1) and (2), and then to the call
  pgstat_acquire_entry_ref(), which increases refcount once more.

5) the backend 2 reaches
Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0),
  which fails due to refcount = 2.

Thanks for the details.

So this comes down to the point that we offer no guarantee that the
stats entry a backend sees at shutdown is the same as the one it wants
to clean up. That's the same problem as what Floris has reported
here, with an OID wraparound and tables to get the same hash key.
That can happen for all stats kinds:
/messages/by-id/b14ae28029f64757bb64613be2549a74@Optiver.com

I don't think that this is going to fly far except if we introduce a
concept of "generation" or "age" in the stats entries. The idea is
simple: when a stats entry is reinitialized because of a drop&create,
increment a counter to tell that this is a new generation, and keep
track of it in *both* PgStat_EntryRef (local backend reference to the
shmem stats entry) *and* PgStatShared_HashEntry (the central one).
When releasing an entry, if we know that the shared entry we are
pointing at is not of the same generation as the local reference, it
means that the entry has been reused for something else with the same
hash key, so give up. It should not be that invasive, still it means
ABI breakage in the two pgstats internal structures I am mentioning,
which is OK for a backpatch as this stuff is internal. On top of
that, this problem means that we can silently and randomly lose stats,
which is not cool.

Note that Noah has been working on a better integration of injection
points with isolation tests. We could reuse that here to have a test
case, with an injection point waiting around the pg_usleep() you have
hardcoded:
/messages/by-id/20240614003549.c2.nmisch@google.com

I'll try to give it a go on Monday.
--
Michael

#7Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#6)
1 attachment(s)
Re: BF mamba failure

On Fri, Jun 14, 2024 at 02:31:37PM +0900, Michael Paquier wrote:

I don't think that this is going to fly far except if we introduce a
concept of "generation" or "age" in the stats entries. The idea is
simple: when a stats entry is reinitialized because of a drop&create,
increment a counter to tell that this is a new generation, and keep
track of it in *both* PgStat_EntryRef (local backend reference to the
shmem stats entry) *and* PgStatShared_HashEntry (the central one).
When releasing an entry, if we know that the shared entry we are
pointing at is not of the same generation as the local reference, it
means that the entry has been reused for something else with the same
hash key, so give up. It should not be that invasive, still it means
ABI breakage in the two pgstats internal structures I am mentioning,
which is OK for a backpatch as this stuff is internal. On top of
that, this problem means that we can silently and randomly lose stats,
which is not cool.

I'll try to give it a go on Monday.

Here you go, the patch introduces what I've named an "age" counter
attached to the shared entry references, and copied over to the local
references. The countner is initialized at 0 and incremented each
time an entry is reused, then when attempting to drop an entry we
cross-check the version hold locally with the shared one.

While looking at the whole, this is close to a concept patch sent
previously, where a counter is used in the shared entry with a
cross-check done with the local reference, that was posted here
(noticed that today):
/messages/by-id/20230603.063418.767871221863527769.horikyota.ntt@gmail.com

The logic is different though, as we don't need to care about the
contents of the local cache when cross-checking the "age" count when
retrieving the contents, just the case where a backend would attempt
to drop an entry it thinks is OK to operate on, that got reused
because of the effect of other backends doing creates and drops with
the same hash key.

This idea needs more eyes, so I am adding that to the next CF for now.
I've played with it for a few hours and concurrent replication slot
drops/creates, without breaking it. I have not implemented an
isolation test for this case, as it depends on where we are going with
their integration with injection points.
--
Michael

Attachments:

0001-Introduce-agecount-for-shared-pgstats.patchtext/x-diff; charset=us-asciiDownload
From 3d9a8b6913c0ce61d800d3b31f392230580e99fc Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 17 Jun 2024 13:22:33 +0900
Subject: [PATCH] Introduce "agecount" for shared pgstats

This fixes a set of race conditions with cumulative statistics where a
shared stats entry could be dropped while it should still be valid
when it gets reused.  This can happen for the following case, causing
the same hash key to be used:
- replication slots that compute internally an index number.
- other object types, with a wraparound involved.

The age counter is tracked in the shared entries, initialized at 0 when
an entry is created and incremented when the same entry is reused, to
avoid concurrent turbulences on drop because of other backends still
holding a reference to it.
---
 src/include/utils/pgstat_internal.h       | 14 ++++++++
 src/backend/utils/activity/pgstat_shmem.c | 40 ++++++++++++++++++++---
 2 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index dbbca31602..2ea29b7638 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -93,6 +93,14 @@ typedef struct PgStatShared_HashEntry
 	 */
 	pg_atomic_uint32 refcount;
 
+	/*
+	 * Counter tracking the number of times the entry has been reused.
+	 *
+	 * Set to 0 when the entry is created, and incremented by one each time
+	 * the shared entry is reinitialized with pgstat_reinit_entry().
+	 */
+	pg_atomic_uint32 agecount;
+
 	/*
 	 * Pointer to shared stats. The stats entry always starts with
 	 * PgStatShared_Common, embedded in a larger struct containing the
@@ -132,6 +140,12 @@ typedef struct PgStat_EntryRef
 	 */
 	PgStatShared_Common *shared_stats;
 
+	/*
+	 * Copy of PgStatShared_HashEntry->agecount, keeping locally track of the
+	 * shared stats entry "age" retrieved (number of times reused).
+	 */
+	uint32		agecount;
+
 	/*
 	 * Pending statistics data that will need to be flushed to shared memory
 	 * stats eventually. Each stats kind utilizing pending data defines what
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index fb79c5e771..a9de1985a4 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -270,6 +270,11 @@ pgstat_init_entry(PgStat_Kind kind,
 	 * further if a longer lived reference is needed.
 	 */
 	pg_atomic_init_u32(&shhashent->refcount, 1);
+
+	/*
+	 * Initialize agecount to 0, as freshly initialized.
+	 */
+	pg_atomic_init_u32(&shhashent->agecount, 0);
 	shhashent->dropped = false;
 
 	chunk = dsa_allocate0(pgStatLocal.dsa, pgstat_get_kind_info(kind)->shared_size);
@@ -293,6 +298,12 @@ pgstat_reinit_entry(PgStat_Kind kind, PgStatShared_HashEntry *shhashent)
 
 	/* mark as not dropped anymore */
 	pg_atomic_fetch_add_u32(&shhashent->refcount, 1);
+
+	/*
+	 * Increment agecount, to let any backend with local references know that
+	 * what they point to is outdated.
+	 */
+	pg_atomic_fetch_add_u32(&shhashent->agecount, 1);
 	shhashent->dropped = false;
 
 	/* reinitialize content */
@@ -333,6 +344,7 @@ pgstat_acquire_entry_ref(PgStat_EntryRef *entry_ref,
 
 	entry_ref->shared_stats = shheader;
 	entry_ref->shared_entry = shhashent;
+	entry_ref->agecount = pg_atomic_read_u32(&shhashent->agecount);
 }
 
 /*
@@ -491,7 +503,8 @@ pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid, bool create,
 			 * case are replication slot stats, where a new slot can be
 			 * created with the same index just after dropping. But oid
 			 * wraparound can lead to other cases as well. We just reset the
-			 * stats to their plain state.
+			 * stats to their plain state, while incrementing its "age" in the
+			 * shared entry for any remaining local references.
 			 */
 			shheader = pgstat_reinit_entry(kind, shhashent);
 			pgstat_acquire_entry_ref(entry_ref, shhashent, shheader);
@@ -558,10 +571,27 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 			if (!shent)
 				elog(ERROR, "could not find just referenced shared stats entry");
 
-			Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0);
-			Assert(entry_ref->shared_entry == shent);
-
-			pgstat_free_entry(shent, NULL);
+			/*
+			 * This entry may have been reinitialized while trying to release
+			 * it, so double-check that it has not been reused while holding a
+			 * lock on its shared entry.
+			 */
+			if (pg_atomic_read_u32(&entry_ref->shared_entry->agecount) ==
+				entry_ref->agecount)
+			{
+				/* Same generation, so we're OK with the removal */
+				Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0);
+				Assert(entry_ref->shared_entry == shent);
+				pgstat_free_entry(shent, NULL);
+			}
+			else
+			{
+				/*
+				 * Shared stats entry has been reinitialized, so do not drop
+				 * its shared entry, only release its lock.
+				 */
+				dshash_release_lock(pgStatLocal.shared_hash, shent);
+			}
 		}
 	}
 
-- 
2.45.1

#8Kouber Saparev
kouber@gmail.com
In reply to: Michael Paquier (#7)
Re: BF mamba failure

Am I correct to believe this patch is fixing the "can only drop stats once" issue?

It just happened to us, one of our streaming replicas decided to shut down.

#9Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Kouber Saparev (#8)
Re: BF mamba failure

Hi,

On Thu, Oct 10, 2024 at 06:19:30AM +0000, Kouber Saparev wrote:

Am I correct to believe this patch is fixing the "can only drop stats once" issue?

It just happened to us, one of our streaming replicas decided to shut down.

Does the error message looks like (added in [1]https://postgr.es/c/d92573adcb):

"
trying to drop stats entry already dropped: kind=%s dboid=%u objoid=%u refcount=%u",
"

by any chance? If so, would you mind to share it?

Regards,

[1]: https://postgr.es/c/d92573adcb

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#10Kouber Saparev
kouber@gmail.com
In reply to: Bertrand Drouvot (#9)
Re: BF mamba failure

На чт, 10.10.2024 г. в 17:42 Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
написа:

Does the error message looks like (added in [1]):

"
trying to drop stats entry already dropped: kind=%s dboid=%u objoid=%u
refcount=%u",
"

by any chance? If so, would you mind to share it?

Regards,

[1]: https://postgr.es/c/d92573adcb

Unfortunately not, we are running 15.4 and planning to upgrade very soon.
Is the patch mentioned already merged in PostgreSQL 16?

The last lines in the log are:

2024-09-22 07:33:53.913 UTC,,,223214,,64e79dc1.367ee,3255151,,2023-08-24
18:13:21 UTC,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at
45F8F/DB943818 for Transaction/COMMIT: 2024-09-22 07:24:49.603408+00;
dropped stats: 2/16420/3237787065 2/16420/3237787064 2/16420/3237787061;
inval msgs: catcache 32 catcache 7 catcache 6 catcache 7 catcache 6
catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80
catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 55 catcache 54 relcache 3237787065 relcache 3237787064 snapshot
2608 relcache 3237787064 snapshot 2608 snapshot 2608 snapshot 2608 relcache
3237787061 snapshot 2608 snapshot 1214",,,,"","startup",,0
2024-09-22 07:33:54.501 UTC,,,223207,,64e79dc0.367e7,1423,,2023-08-24
18:13:20 UTC,,0,LOG,00000,"startup process (PID 223214) exited with exit
code 1",,,,,,,,,"","postmaster",,0
2024-09-22 07:33:54.501 UTC,,,223207,,64e79dc0.367e7,1424,,2023-08-24
18:13:20 UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,"","postmaster",,0
2024-09-22 07:33:55.398 UTC,,,223207,,64e79dc0.367e7,1425,,2023-08-24
18:13:20 UTC,,0,LOG,00000,"shutting down due to startup process
failure",,,,,,,,,"","postmaster",,0
2024-09-22 07:33:56.697 UTC,,,223207,,64e79dc0.367e7,1426,,2023-08-24
18:13:20 UTC,,0,LOG,00000,"database system is shut
down",,,,,,,,,"","postmaster",,0

--
Kouber

#11Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Kouber Saparev (#10)
Re: BF mamba failure

Hi,

On Fri, Oct 11, 2024 at 11:07:29AM +0300, Kouber Saparev wrote:

На чт, 10.10.2024 г. в 17:42 Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
написа:

Does the error message looks like (added in [1]):

"
trying to drop stats entry already dropped: kind=%s dboid=%u objoid=%u
refcount=%u",
"

by any chance? If so, would you mind to share it?

Regards,

[1]: https://postgr.es/c/d92573adcb

Unfortunately not, we are running 15.4 and planning to upgrade very soon.
Is the patch mentioned already merged in PostgreSQL 16?

Yes, as of 16.4.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#12Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#11)
Re: BF mamba failure

On Fri, Oct 11, 2024 at 08:18:58AM +0000, Bertrand Drouvot wrote:

On Fri, Oct 11, 2024 at 11:07:29AM +0300, Kouber Saparev wrote:

Unfortunately not, we are running 15.4 and planning to upgrade very soon.
Is the patch mentioned already merged in PostgreSQL 16?

Yes, as of 16.4.

Right. I'd surely welcome more eyes on what I have posted here:
/messages/by-id/Zm-8Xo93K9yD9fy7@paquier.xyz

I am a bit annoyed by the fact of making PgStatShared_HashEntry
slightly larger to track the age of the entries, especially as this
comes down to being useful only when OID wraparound for all stats
kinds (well, except replslot data, because slot IDs can be more
aggressive or custom pgstats), but I don't really have a brighter idea
for the reason that this is something that we need to track at
entry-level to be able to tell across backends when a single stats
dshash entry can or cannot be reused.
--
Michael

#13Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#12)
Re: BF mamba failure

Hi,

On Wed, Oct 16, 2024 at 09:43:48AM +0900, Michael Paquier wrote:

On Fri, Oct 11, 2024 at 08:18:58AM +0000, Bertrand Drouvot wrote:

On Fri, Oct 11, 2024 at 11:07:29AM +0300, Kouber Saparev wrote:

Unfortunately not, we are running 15.4 and planning to upgrade very soon.
Is the patch mentioned already merged in PostgreSQL 16?

Yes, as of 16.4.

Right. I'd surely welcome more eyes on what I have posted here:
/messages/by-id/Zm-8Xo93K9yD9fy7@paquier.xyz

I applied your patch and do confirm that it fixes the issue. While that works I
wonder is there no other way to fix the issue.

Indeed, in pgstat_release_entry_ref(), we're doing:

"
if (pg_atomic_fetch_sub_u32(&entry_ref->shared_entry->refcount, 1) == 1)
.
.
shent = dshash_find(pgStatLocal.shared_hash,
&entry_ref->shared_entry->key,
true);
"

I wonder if we are not decrementing &entry_ref->shared_entry->refcount too early.

I mean, wouldn't that make sense to decrement it after the dshash_find() call?
(to ensure a "proper" whole entry locking, done in dshash_find(), first)

I am a bit annoyed by the fact of making PgStatShared_HashEntry
slightly larger to track the age of the entries,

Yeah, FWIW, we would be going from 32 bytes:

(gdb) ptype /o struct PgStatShared_HashEntry
/* offset | size */ type = struct PgStatShared_HashEntry {
/* 0 | 16 */ PgStat_HashKey key;
/* 16 | 1 */ _Bool dropped;
/* XXX 3-byte hole */
/* 20 | 4 */ pg_atomic_uint32 refcount;
/* 24 | 8 */ dsa_pointer body;

/* total size (bytes): 32 */
}

to 40 bytes (with the patch applied):

(gdb) ptype /o struct PgStatShared_HashEntry
/* offset | size */ type = struct PgStatShared_HashEntry {
/* 0 | 16 */ PgStat_HashKey key;
/* 16 | 1 */ _Bool dropped;
/* XXX 3-byte hole */
/* 20 | 4 */ pg_atomic_uint32 refcount;
/* 24 | 4 */ pg_atomic_uint32 agecount;
/* XXX 4-byte hole */
/* 32 | 8 */ dsa_pointer body;

/* total size (bytes): 40 */
}

Due to the padding, that's a 8 bytes increase while we're adding "only" 4 bytes.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#14Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#13)
Re: BF mamba failure

On Wed, Oct 16, 2024 at 10:11:08AM +0000, Bertrand Drouvot wrote:

Indeed, in pgstat_release_entry_ref(), we're doing:

if (pg_atomic_fetch_sub_u32(&entry_ref->shared_entry->refcount, 1) == 1)
.
.
shent = dshash_find(pgStatLocal.shared_hash,
&entry_ref->shared_entry->key,
true);

I wonder if we are not decrementing &entry_ref->shared_entry->refcount too early.

I mean, wouldn't that make sense to decrement it after the dshash_find() call?
(to ensure a "proper" whole entry locking, done in dshash_find(), first)

Making that a two-step process (first step to read the refcount,
second step to decrement it after taking the exclusive lock through
dshash_find) would make the logic more complicated what what we have
now, without offering more protection, afaik, because you'd still need
to worry about a race condition between the first and second steps.
Making this a one-step only would incur more dshash_find() calls than
we actually need, and I would not underestimate that under a
heavily-concurrent pgstat_release_entry_ref() path taken.

Yeah, FWIW, we would be going from 32 bytes:
/* total size (bytes): 32 */

to 40 bytes (with the patch applied):
/* total size (bytes): 40 */

Due to the padding, that's a 8 bytes increase while we're adding "only" 4 bytes.

I have spent some time digging into all the original pgstat threads
dealing with the shmem implementation or dshash, and I'm not really
seeing anybody stressing on the size of the individual hash entries.
This stuff was already wasting space with padding originally, so
perhaps we're stressing too much here? Would anybody like to comment?
--
Michael

#15Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#14)
1 attachment(s)
Re: BF mamba failure

On Thu, Oct 17, 2024 at 01:24:50PM +0900, Michael Paquier wrote:

Yeah, FWIW, we would be going from 32 bytes:
/* total size (bytes): 32 */

to 40 bytes (with the patch applied):
/* total size (bytes): 40 */

Due to the padding, that's a 8 bytes increase while we're adding "only" 4 bytes.

I have spent some time digging into all the original pgstat threads
dealing with the shmem implementation or dshash, and I'm not really
seeing anybody stressing on the size of the individual hash entries.
This stuff was already wasting space with padding originally, so
perhaps we're stressing too much here? Would anybody like to comment?

I've been going through this patch again, and the failures that could
be seen because of such failures, like standbys failing in an
unpredictible way is not cool, so I am planning to apply the attached
down to 15 now that the release has colled down. At the end I am not
really stressing about this addition in the structure for the sake of
making the stats entries safe to handle.

I did not like much the name "agecount" though, used to cross-check
how many times an entry is reused in shared memory and in the local
copy we keep in a process, so I've renamed it to "generation".
--
Michael

Attachments:

v2-0001-Introduce-generation-counter-for-pgstats-entries.patchtext/x-diff; charset=us-asciiDownload
From abd8b567f3cd97ca4e0282ce7657f64807f8c008 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 14 Nov 2024 16:53:58 +0900
Subject: [PATCH v2] Introduce "generation" counter for pgstats entries

This fixes a set of race conditions with cumulative statistics where a
shared stats entry could be dropped while it should still be valid
when it gets reused.  This can happen for the following case, causing
the same hash key to be used:
- replication slots that compute internally an index number.
- other object types, with a wraparound involved.
- As of PostgreSQL 18, custom pgstats kinds.

The "generation" is tracked in the shared entries, initialized at 0 when
an entry is created and incremented when the same entry is reused, to
avoid concurrent turbulences on drop because of other backends still
holding a reference to it.
---
 src/include/utils/pgstat_internal.h       | 14 ++++++++
 src/backend/utils/activity/pgstat_shmem.c | 40 ++++++++++++++++++++---
 2 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 61b2e1f96b..45f5901d45 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -94,6 +94,14 @@ typedef struct PgStatShared_HashEntry
 	 */
 	pg_atomic_uint32 refcount;
 
+	/*
+	 * Counter tracking the number of times the entry has been reused.
+	 *
+	 * Set to 0 when the entry is created, and incremented by one each time
+	 * the shared entry is reinitialized with pgstat_reinit_entry().
+	 */
+	pg_atomic_uint32 generation;
+
 	/*
 	 * Pointer to shared stats. The stats entry always starts with
 	 * PgStatShared_Common, embedded in a larger struct containing the
@@ -133,6 +141,12 @@ typedef struct PgStat_EntryRef
 	 */
 	PgStatShared_Common *shared_stats;
 
+	/*
+	 * Copy of PgStatShared_HashEntry->generation, keeping locally track of
+	 * the shared stats entry "generation" retrieved (number of times reused).
+	 */
+	uint32		generation;
+
 	/*
 	 * Pending statistics data that will need to be flushed to shared memory
 	 * stats eventually. Each stats kind utilizing pending data defines what
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index c1b7ff76b1..c8ee0bd0b0 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -304,6 +304,11 @@ pgstat_init_entry(PgStat_Kind kind,
 	 * further if a longer lived reference is needed.
 	 */
 	pg_atomic_init_u32(&shhashent->refcount, 1);
+
+	/*
+	 * Initialize generation to 0, as freshly initialized.
+	 */
+	pg_atomic_init_u32(&shhashent->generation, 0);
 	shhashent->dropped = false;
 
 	chunk = dsa_allocate0(pgStatLocal.dsa, pgstat_get_kind_info(kind)->shared_size);
@@ -327,6 +332,12 @@ pgstat_reinit_entry(PgStat_Kind kind, PgStatShared_HashEntry *shhashent)
 
 	/* mark as not dropped anymore */
 	pg_atomic_fetch_add_u32(&shhashent->refcount, 1);
+
+	/*
+	 * Increment generation, to let any backend with local references know
+	 * that what they point to is outdated.
+	 */
+	pg_atomic_fetch_add_u32(&shhashent->generation, 1);
 	shhashent->dropped = false;
 
 	/* reinitialize content */
@@ -367,6 +378,7 @@ pgstat_acquire_entry_ref(PgStat_EntryRef *entry_ref,
 
 	entry_ref->shared_stats = shheader;
 	entry_ref->shared_entry = shhashent;
+	entry_ref->generation = pg_atomic_read_u32(&shhashent->generation);
 }
 
 /*
@@ -532,7 +544,8 @@ pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, uint64 objid, bool create,
 			 * case are replication slot stats, where a new slot can be
 			 * created with the same index just after dropping. But oid
 			 * wraparound can lead to other cases as well. We just reset the
-			 * stats to their plain state.
+			 * stats to their plain state, while incrementing its "generation"
+			 * in the shared entry for any remaining local references.
 			 */
 			shheader = pgstat_reinit_entry(kind, shhashent);
 			pgstat_acquire_entry_ref(entry_ref, shhashent, shheader);
@@ -599,10 +612,27 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 			if (!shent)
 				elog(ERROR, "could not find just referenced shared stats entry");
 
-			Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0);
-			Assert(entry_ref->shared_entry == shent);
-
-			pgstat_free_entry(shent, NULL);
+			/*
+			 * This entry may have been reinitialized while trying to release
+			 * it, so double-check that it has not been reused while holding a
+			 * lock on its shared entry.
+			 */
+			if (pg_atomic_read_u32(&entry_ref->shared_entry->generation) ==
+				entry_ref->generation)
+			{
+				/* Same generation, so we're OK with the removal */
+				Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0);
+				Assert(entry_ref->shared_entry == shent);
+				pgstat_free_entry(shent, NULL);
+			}
+			else
+			{
+				/*
+				 * Shared stats entry has been reinitialized, so do not drop
+				 * its shared entry, only release its lock.
+				 */
+				dshash_release_lock(pgStatLocal.shared_hash, shent);
+			}
 		}
 	}
 
-- 
2.45.2

#16Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#15)
Re: BF mamba failure

Hi,

On Thu, Nov 14, 2024 at 04:55:23PM +0900, Michael Paquier wrote:

I've been going through this patch again, and the failures that could
be seen because of such failures, like standbys failing in an
unpredictible way is not cool, so I am planning to apply the attached
down to 15 now that the release has colled down. At the end I am not
really stressing about this addition in the structure for the sake of
making the stats entries safe to handle.

I don't think the addition is a concern too.

I did not like much the name "agecount" though, used to cross-check
how many times an entry is reused in shared memory and in the local
copy we keep in a process, so I've renamed it to "generation".

"generation" sounds more appropriate to me too.

The approach to avoid the error sounds reasonable to me.

Just 2 comments about the patch:

=== 1

Maybe use "generation" instead of generation in the comments (where it's not done,
some comments do it already).

=== 2

We could think about adding a test. That should be doable with replication slots
or custom pgstats kinds and probably injection points. But I'm not sure that's
worth it, as the code in the patch looks "simple" enough. Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#16)
Re: BF mamba failure

On Thu, Nov 14, 2024 at 10:07:41AM +0000, Bertrand Drouvot wrote:

=== 1

Maybe use "generation" instead of generation in the comments (where it's not done,
some comments do it already).

I've tweaked things to be more consistency, and applied that down to 15.

=== 2

We could think about adding a test. That should be doable with replication slots
or custom pgstats kinds and probably injection points. But I'm not sure that's
worth it, as the code in the patch looks "simple" enough. Thoughts?

The tricky part is that this path involves a backend shutdown. We
should be able to do something with a wait before the dshash_find() in
pgstat_release_entry_ref() with a BackgroundPsql object that gets
stopped, but I'm not sure if we can guarantee a lookup of
pg_stat_activity at this stage. Let me see..
--
Michael

#18Kouber Saparev
kouber@gmail.com
In reply to: Michael Paquier (#17)
Re: BF mamba failure

We just had the same sudden replica shutdown, this time with version 17.3.

2025-08-02 22:10:02.229 UTC,,,473966,,67b3d76c.73b6e,14,,2025-02-18
00:42:20 UTC,8111/0,0,FATAL,XX000,"trying to drop stats entry already
dropped: kind=relation dboid=16420 objoid=4169049057 refcount=1",,,,,"WAL
redo at 6337E/9FF275E8 for Transaction/COMMIT: 2025-08-02
22:10:02.163756+00; dropped stats: 2/16420/4169049057 2/16420/4169049056
2/16420/4169049053 2/16420/4169049071 2/16420/4169049070 2/16420/4169049068
2/16420/4169049064 2/16420/4169049154 2/16420/4169049153 2/16420/4169049152
2/16420/4169049149; inval msgs: catcache 80 c
atcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 55 catcache 54 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54
catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 cat
cache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54
catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55
catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache
54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55
catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 ca
tcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache
6 catcache 32 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55
catcache 54 catcache 55 catcache 54 catcache 7 catcache 6 catcache 32
catcache 55 catcache 54 catcache 55 catcache 54 cat
cache 32 catcache 7 catcache 6 catcache 55 catcache 54 catcache 32 catcache
7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 cat
cache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55
catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 cat
cache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache
80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 c
atcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catca
che 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32
catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54
catcache 55 catcache 54 catcache 7 catcache 6 catcache 32 catcache 55
catcache 54 catcache 55 catcache 54 catcache 32 catca
che 7 catcache 6 catcache 55 catcache 54 catcache 32 catcache 7 catcache 6
catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcac
he 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54
catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcac
he 6 catcache 7 catcache 6 catcache 55 catcache 54 snapshot 2608 relcache
4169049053 snapshot 1214 relcache 4169049056 relcache 4169049057 relcache
4169049056 snapshot 2608 relcache 4169049056 relcache 4169049057 relcache
4169049053 snapshot 2608 relcache 4169049057 rel
cache 4169049056 snapshot 2608 relcache 4169049056 snapshot 2608 snapshot
2608 snapshot 2608 relcache 4169049053 snapshot 2608 snapshot 1214 snapshot
2608 relcache 4169049064 snapshot 1214 relcache 4169049068 relcache
4169049070 relcache 4169049068 snapshot 2608 relcach
e 4169049068 relcache 4169049070 relcache 4169049064 snapshot 2608 relcache
4169049071 relcache 4169049064 snapshot 2608 relcache 4169049064 relcache
4169049071 relcache 4169049071 relcache 4169049064 snapshot 2608 relcache
4169049070 relcache 4169049068 snapshot 2608 r
elcache 4169049068 snapshot 2608 snapshot 2608 snapshot 2608 relcache
4169049064 snapshot 2608 snapshot 1214 snapshot 2608 relcache 4169049149
snapshot 1214 relcache 4169049152 relcache 4169049153 relcache 4169049152
snapshot 2608 relcache 4169049152 relcache 4169049153
relcache 4169049149 snapshot 2608 relcache 4169049154 relcache 4169049149
snapshot 2608 relcache 4169049149 relcache 4169049154 relcache 4169049154
relcache 4169049149 snapshot 2608 relcache 4169049153 relcache 4169049152
snapshot 2608 relcache 4169049152 snapshot 2608
snapshot 2608 snapshot 2608 relcache 4169049149 snapshot 2608 snapshot
1214",,,,"","startup",,0
2025-08-02 22:10:02.389 UTC,,,473962,,67b3d76b.73b6a,27,,2025-02-18
00:42:19 UTC,,0,LOG,00000,"startup process (PID 473966) exited with exit
code 1",,,,,,,,,"","postmaster",,0
2025-08-02 22:10:02.389 UTC,,,473962,,67b3d76b.73b6a,28,,2025-02-18
00:42:19 UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,"","postmaster",,0
2025-08-02 22:10:02.516 UTC,,,473962,,67b3d76b.73b6a,29,,2025-02-18
00:42:19 UTC,,0,LOG,00000,"shutting down due to startup process
failure",,,,,,,,,"","postmaster",,0
2025-08-02 22:10:02.563 UTC,,,473962,,67b3d76b.73b6a,30,,2025-02-18
00:42:19 UTC,,0,LOG,00000,"database system is shut
down",,,,,,,,,"","postmaster",,0

На пт, 15.11.2024 г. в 6:45 Michael Paquier <michael@paquier.xyz> написа:

Show quoted text

On Thu, Nov 14, 2024 at 10:07:41AM +0000, Bertrand Drouvot wrote:

=== 1

Maybe use "generation" instead of generation in the comments (where it's

not done,

some comments do it already).

I've tweaked things to be more consistency, and applied that down to 15.

=== 2

We could think about adding a test. That should be doable with

replication slots

or custom pgstats kinds and probably injection points. But I'm not sure

that's

worth it, as the code in the patch looks "simple" enough. Thoughts?

The tricky part is that this path involves a backend shutdown. We
should be able to do something with a wait before the dshash_find() in
pgstat_release_entry_ref() with a BackgroundPsql object that gets
stopped, but I'm not sure if we can guarantee a lookup of
pg_stat_activity at this stage. Let me see..
--
Michael

#19Michael Paquier
michael@paquier.xyz
In reply to: Kouber Saparev (#18)
Re: BF mamba failure

On Mon, Aug 04, 2025 at 01:00:35PM +0300, Kouber Saparev wrote:

We just had the same sudden replica shutdown, this time with version 17.3.

2025-08-02 22:10:02.229 UTC,,,473966,,67b3d76c.73b6e,14,,2025-02-18
00:42:20 UTC,8111/0,0,FATAL,XX000,"trying to drop stats entry already
dropped: kind=relation dboid=16420 objoid=4169049057 refcount=1",,,,,"WAL
redo at 6337E/9FF275E8 for Transaction/COMMIT: 2025-08-02

You are close to wraparound for this relation. Is that vanilla
streaming replication with heap?
--
Michael

#20Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#19)
Re: BF mamba failure

Hi,

On Thu, Aug 07, 2025 at 08:18:26AM +0900, Michael Paquier wrote:

On Mon, Aug 04, 2025 at 01:00:35PM +0300, Kouber Saparev wrote:

We just had the same sudden replica shutdown, this time with version 17.3.

2025-08-02 22:10:02.229 UTC,,,473966,,67b3d76c.73b6e,14,,2025-02-18
00:42:20 UTC,8111/0,0,FATAL,XX000,"trying to drop stats entry already
dropped: kind=relation dboid=16420 objoid=4169049057 refcount=1",,,,,"WAL
redo at 6337E/9FF275E8 for Transaction/COMMIT: 2025-08-02

You are close to wraparound for this relation. Is that vanilla
streaming replication with heap?

I was wondering about adding the "generation" (introduced in 818119afccd3) to
the above error log message, thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#20)
Re: BF mamba failure

On Thu, Aug 07, 2025 at 05:28:17AM +0000, Bertrand Drouvot wrote:

On Thu, Aug 07, 2025 at 08:18:26AM +0900, Michael Paquier wrote:

On Mon, Aug 04, 2025 at 01:00:35PM +0300, Kouber Saparev wrote:
You are close to wraparound for this relation. Is that vanilla
streaming replication with heap?

I was wondering about adding the "generation" (introduced in 818119afccd3) to
the above error log message, thoughts?

Yes, that could prove to be useful when debugging such issues. We do
not have a lot of information here.
--
Michael

#22Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#21)
1 attachment(s)
Re: BF mamba failure

Hi,

On Thu, Aug 07, 2025 at 04:17:29PM +0900, Michael Paquier wrote:

On Thu, Aug 07, 2025 at 05:28:17AM +0000, Bertrand Drouvot wrote:

On Thu, Aug 07, 2025 at 08:18:26AM +0900, Michael Paquier wrote:

On Mon, Aug 04, 2025 at 01:00:35PM +0300, Kouber Saparev wrote:
You are close to wraparound for this relation. Is that vanilla
streaming replication with heap?

I was wondering about adding the "generation" (introduced in 818119afccd3) to
the above error log message, thoughts?

Yes, that could prove to be useful when debugging such issues. We do
not have a lot of information here.

Yes, done in the attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Add-more-debugging-information-when-dropping-twic.patchtext/x-diff; charset=us-asciiDownload
From db0fb464258992b2596d528b14c5b73909e5d93e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Thu, 7 Aug 2025 07:58:40 +0000
Subject: [PATCH v1] Add more debugging information when dropping twice pgstats
 entry

Dropping twice a pgstats entry should not happen, still the error generated
is missing the "generation" counter that has been added in 818119afccd3.

This commit adds it as it could be helpful for debugging.
---
 src/backend/utils/activity/pgstat_shmem.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 53e7d534270..cca4277f234 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -874,11 +874,12 @@ pgstat_drop_entry_internal(PgStatShared_HashEntry *shent,
 	 */
 	if (shent->dropped)
 		elog(ERROR,
-			 "trying to drop stats entry already dropped: kind=%s dboid=%u objid=%" PRIu64 " refcount=%u",
+			 "trying to drop stats entry already dropped: kind=%s dboid=%u objid=%" PRIu64 " refcount=%u generation=%u",
 			 pgstat_get_kind_info(shent->key.kind)->name,
 			 shent->key.dboid,
 			 shent->key.objid,
-			 pg_atomic_read_u32(&shent->refcount));
+			 pg_atomic_read_u32(&shent->refcount),
+			 pg_atomic_read_u32(&shent->generation));
 	shent->dropped = true;
 
 	/* release refcount marking entry as not dropped */
-- 
2.34.1

#23Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#22)
Re: BF mamba failure

On Thu, Aug 07, 2025 at 08:35:17AM +0000, Bertrand Drouvot wrote:

On Thu, Aug 07, 2025 at 04:17:29PM +0900, Michael Paquier wrote:

Yes, that could prove to be useful when debugging such issues. We do
not have a lot of information here.

Yes, done in the attached.

Thanks. With the next minor release happening next week, done that
now so as we will able to gather more information from the field.
--
Michael

#24Kouber Saparev
kouber@gmail.com
In reply to: Michael Paquier (#19)
Re: BF mamba failure

Indeed, nothing exotic about our replication.

As for the object 4169049057, I am not able to find it anywhere in the
catalogs. Perhaps it was something that was dropped in the meantime.

На чт, 7.08.2025 г. в 2:18 Michael Paquier <michael@paquier.xyz> написа:

Show quoted text

On Mon, Aug 04, 2025 at 01:00:35PM +0300, Kouber Saparev wrote:

We just had the same sudden replica shutdown, this time with version

17.3.

2025-08-02 22:10:02.229 UTC,,,473966,,67b3d76c.73b6e,14,,2025-02-18
00:42:20 UTC,8111/0,0,FATAL,XX000,"trying to drop stats entry already
dropped: kind=relation dboid=16420 objoid=4169049057 refcount=1",,,,,"WAL
redo at 6337E/9FF275E8 for Transaction/COMMIT: 2025-08-02

You are close to wraparound for this relation. Is that vanilla
streaming replication with heap?
--
Michael

#25Kouber Saparev
kouber@gmail.com
In reply to: Kouber Saparev (#24)
Re: BF mamba failure

Yet again one of our replicas died. Should I file a bug report or
something, what should we do in order to prevent it? Restart the database
every month/week or so?...

2025-09-09 12:08:10.688 UTC,,,1510554,,67b3d8bc.170c9a,337,,2025-02-18
00:47:56 UTC,8111/0,0,FATAL,XX000,"trying to drop stats entry already
dropped: kind=relation dboid=16420 objoid=767325170 refcount=2"
,,,,,"WAL redo at 66E4F/74A0F210 for Transaction/COMMIT: 2025-09-09
12:08:10.657171+00; dropped stats: 2/16420/767325174 2/16420/767325173
2/16420/767325170 2/16420/767325192 2/16420/767325191 2/16420/767
325189 2/16420/767325186 2/16420/767325198 2/16420/767325197
2/16420/767325196 2/16420/767325193; inval msgs: catcache 80 catcache 79
catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catc
ache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 55 catcache 54 catcache
7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54
catcache 55 catcache 54 catcache 55 catcache 54 catcache 32 catcache 7
catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catc
ache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 5
5 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80
catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 c
atcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54
catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catc
ache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54
catcache 55 catcache 54 catcache 7 catcache 6 catcache 32 catcache 55
catcache 54 catcache 55 catcache 54 catcache 32 catcache 7 catcach
e 6 catcache 55 catcache 54 catcache 32 catcache 7 catcache 6 catcache 7
catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80
catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 8
0 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 c
atcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catca
che 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6
catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55
catcache 54 catcache 55 catcache 54 catcache 55 catcache 54
catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55
catcache 54 catcache 32 catcache 7 catcache 6 catcache 55 catcache 54
catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache
55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 c
atcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80
catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 c
atcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache
54 snapshot 2608 relcache 767325170 snapshot 1214 relcache 767325173
relcache 767325174 relcache 767325173 snapshot 2608 relcache
767325173 relcache 767325174 relcache 767325170 snapshot 2608 relcache
767325174 relcache 767325173 snapshot 2608 relcache 767325173 snapshot 2608
snapshot 2608 snapshot 2608 relcache 767325170 snapshot
2608 snapshot 1214 snapshot 2608 relcache 767325186 snapshot 1214 relcache
767325189 relcache 767325191 relcache 767325189 snapshot 2608 relcache
767325189 relcache 767325191 relcache 767325186 snapshot 2
608 relcache 767325192 relcache 767325186 snapshot 2608 relcache 767325186
relcache 767325192 relcache 767325192 relcache 767325186 snapshot 2608
relcache 767325191 relcache 767325189 snapshot 2608 relcac
he 767325189 snapshot 2608 snapshot 2608 snapshot 2608 relcache 767325186
snapshot 2608 snapshot 1214 snapshot 2608 relcache 767325193 snapshot 1214
relcache 767325196 relcache 767325197 relcache 76732519
6 snapshot 2608 relcache 767325196 relcache 767325197 relcache 767325193
snapshot 2608 relcache 767325198 relcache 767325193 snapshot 2608 relcache
767325193 relcache 767325198 relcache 767325198 relcache
767325193 snapshot 2608 relcache 767325197 relcache 767325196 snapshot 2608
relcache 767325196 snapshot 2608 snapshot 2608 snapshot 2608 relcache
767325193 snapshot 2608 snapshot 1214",,,,"","startup",,0

На пт, 8.08.2025 г. в 12:04 Kouber Saparev <kouber@gmail.com> написа:

Show quoted text

Indeed, nothing exotic about our replication.

As for the object 4169049057, I am not able to find it anywhere in the
catalogs. Perhaps it was something that was dropped in the meantime.

На чт, 7.08.2025 г. в 2:18 Michael Paquier <michael@paquier.xyz> написа:

On Mon, Aug 04, 2025 at 01:00:35PM +0300, Kouber Saparev wrote:

We just had the same sudden replica shutdown, this time with version

17.3.

2025-08-02 22:10:02.229 UTC,,,473966,,67b3d76c.73b6e,14,,2025-02-18
00:42:20 UTC,8111/0,0,FATAL,XX000,"trying to drop stats entry already
dropped: kind=relation dboid=16420 objoid=4169049057

refcount=1",,,,,"WAL

redo at 6337E/9FF275E8 for Transaction/COMMIT: 2025-08-02

You are close to wraparound for this relation. Is that vanilla
streaming replication with heap?
--
Michael

#26Michael Paquier
michael@paquier.xyz
In reply to: Kouber Saparev (#25)
Re: BF mamba failure

On Tue, Sep 09, 2025 at 04:07:45PM +0300, Kouber Saparev wrote:

Yet again one of our replicas died. Should I file a bug report or
something, what should we do in order to prevent it? Restart the database
every month/week or so?...

I don't think we need another bug to report the same problem. We are
aware that this may be an issue and that it is hard to track, the
problem is to find room to be able to investigate it, at this stage.

I may be able to come back to it soon-ishly, looking at how to trigger
any race condition. The difficulty is to think how the current code
is able to reach this state, because we have a race condition at hand
in standbys.

As a start, are these failures only in the startup process? Has the
startup process reached a consistent state when the problem happens
because the replay code is too eager at removing the stats entries?
Has it not reached a consistent state. These could be useful hints to
extract a reproducible test case, looking for common patterns.

I'll ask around if I have seen cases like that in the user pool I have
an access to.
--
Michael

#27Kouber Saparev
kouber@gmail.com
In reply to: Michael Paquier (#26)
Re: BF mamba failure

На чт, 11.09.2025 г. в 2:28 Michael Paquier <michael@paquier.xyz> написа:

As a start, are these failures only in the startup process? Has the
startup process reached a consistent state when the problem happens
because the replay code is too eager at removing the stats entries?
Has it not reached a consistent state. These could be useful hints to
extract a reproducible test case, looking for common patterns.

The pattern is the same, although I am not 100% sure that the same replica
is having this - it is a cascaded streaming replica, i.e. a replica of
another replica. Once we had this in October 2024, with version 15.4, then
in August 2025 with 17.3, and now in September again (17.3). The database
is working for month(s) perfectly fine in a heavy production workload (lots
of WALs etc.), and then all of a sudden it shuts down.

Thanks for the feedback, and let me know if I could provide any additional
info.

--
Kouber

#28Michael Paquier
michael@paquier.xyz
In reply to: Kouber Saparev (#27)
Re: BF mamba failure

On Thu, Sep 11, 2025 at 04:35:01PM +0300, Kouber Saparev wrote:

The pattern is the same, although I am not 100% sure that the same replica
is having this - it is a cascaded streaming replica, i.e. a replica of
another replica. Once we had this in October 2024, with version 15.4, then
in August 2025 with 17.3, and now in September again (17.3). The database
is working for month(s) perfectly fine in a heavy production workload (lots
of WALs etc.), and then all of a sudden it shuts down.

The shutdown is caused by the startup process choking on redo. FWIW

Thanks for the feedback, and let me know if I could provide any additional
info.

Okay, the bit about the cascading standby is a useful piece of
information. Do you have some data about the relation reported in the
error message this is choking on based on its OID? Is this actively
used in read-only workloads, with the relation looked at in the
cascading standby? Is hot_standby_feedback enabled in the cascading
standby? With which process has this cascading standby been created?
Does the workload of the primary involve a high consumption of OIDs
for relations, say many temporary tables?

Another thing that may help is the WAL record history. Are you for
example seeing attempts to drop twice the same pgstats entry in WAL
records? Perhaps the origin of the problem is in this area. A
refcount of 2 is relevant, of course.

I have looked a bit around but nothing has popped up here, so as far
as I know you seem to be the only one impacted by that.

1d6a03ea4146 and dc5f9054186a are in 17.3, so perhaps something is
still off with the drop when applied to cascading standbys. A vital
piece of information may also be with "generation", which would show
up in the error report thanks to bdda6ba30cbe, and that's included in
17.6. A first thing would be to update to 17.6 and see how things
go for these cascading setups. If it takes a couple of weeks to have
one report, we have a hunt that may take a few months at least, except
if somebody is able to find out the race condition here, me or someone
else.
--
Michael

#29Kouber Saparev
kouber@gmail.com
In reply to: Michael Paquier (#28)
Re: BF mamba failure

На пт, 12.09.2025 г. в 3:37 Michael Paquier <michael@paquier.xyz> написа:

Okay, the bit about the cascading standby is a useful piece of
information. Do you have some data about the relation reported in the
error message this is choking on based on its OID? Is this actively
used in read-only workloads, with the relation looked at in the
cascading standby?

This objoid=767325170 is non-existent, nor was it present in the previous
shutdown (objoid=4169049057). So I guess it is something quasi-temporary
that has been dropped afterwards.

Is hot_standby_feedback enabled in the cascading
standby?

Yes, hot_standby_feedback = on.

With which process has this cascading standby been created?
Does the workload of the primary involve a high consumption of OIDs
for relations, say many temporary tables?

Yes, we have around 150 entries added and deleted per second in pg_class,
and around 800 in pg_attribute. So something is actively creating and
dropping tables all the time.

Another thing that may help is the WAL record history. Are you for
example seeing attempts to drop twice the same pgstats entry in WAL
records? Perhaps the origin of the problem is in this area. A
refcount of 2 is relevant, of course.

How could we dig into this, i.e. inspecting such attempts in the WAL
records?

I have looked a bit around but nothing has popped up here, so as far
as I know you seem to be the only one impacted by that.

1d6a03ea4146 and dc5f9054186a are in 17.3, so perhaps something is
still off with the drop when applied to cascading standbys. A vital
piece of information may also be with "generation", which would show
up in the error report thanks to bdda6ba30cbe, and that's included in
17.6. A first thing would be to update to 17.6 and see how things
go for these cascading setups. If it takes a couple of weeks to have
one report, we have a hunt that may take a few months at least, except
if somebody is able to find out the race condition here, me or someone
else.

Is it enough to upgrade the replicas or we need to upgrade the primary as
well?

--
Kouber

#30Michael Paquier
michael@paquier.xyz
In reply to: Kouber Saparev (#29)
Re: BF mamba failure

On Tue, Sep 16, 2025 at 02:45:03PM +0300, Kouber Saparev wrote:

На пт, 12.09.2025 г. в 3:37 Michael Paquier <michael@paquier.xyz> написа:

With which process has this cascading standby been created?
Does the workload of the primary involve a high consumption of OIDs
for relations, say many temporary tables?

Yes, we have around 150 entries added and deleted per second in pg_class,
and around 800 in pg_attribute. So something is actively creating and
dropping tables all the time.

The number of relations would most likely matter, I doubt that
pg_attribute contributes a lot in the frictions. One of the test
scenarios I am running is exactly that: a bunch of CREATE/DROP tables
running concurrently with backends in a cascading standby accessing
them, while tweaking the OID generated in the control file to force
wraparounds. I am failing to re-create your problem, for now at
least.

Another thing that may help is the WAL record history. Are you for
example seeing attempts to drop twice the same pgstats entry in WAL
records? Perhaps the origin of the problem is in this area. A
refcount of 2 is relevant, of course.

How could we dig into this, i.e. inspecting such attempts in the WAL
records?

Yeah, with pg_walinspect or pg_waldump, but I doubt that you retain
enough WAL history to be able to ping at something specific. One
pattern I am looking for is duplicated drops initiated from the WAL
records, when wraparound hits, or at least if there's a window small
enough that two WAL records are generated and attempt to generate the
same entry drop. That's just an assumption at this stage.

Is it enough to upgrade the replicas or we need to upgrade the primary as
well?

That would be both. There are always reasons to update to the latest
minor version of Postgres. See the release notes for a bunch of them,
that's usually convincing enough especially with CVEs addressed. :)
--
Michael