Fix possible overflow of pg_stat DSA's refcnt

Started by Anthonin Bonnefoyover 1 year ago4 messages
#1Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
1 attachment(s)

Hi,

During backend initialisation, pgStat DSA is attached using
dsa_attach_in_place with a NULL segment. The NULL segment means that
there's no callback to release the DSA when the process exits.
pgstat_detach_shmem only calls dsa_detach which, as mentioned in the
function's comment, doesn't include releasing and doesn't decrement the
reference count of pgStat DSA.

Thus, every time a backend is created, pgStat DSA's refcnt is incremented
but never decremented when the backend shutdown. It will eventually
overflow and reach 0, triggering the "could not attach to dynamic shared
area" error on all newly created backends. When this state is reached, the
only way to recover is to restart the db to reset the counter.

The issue can be visible by calling dsa_dump in pgstat_detach_shmem and
checking that refcnt's value is continuously increasing as new backends are
created. It is also possible to reach the state where all connections are
refused by editing the refcnt manually with lldb/gdb (The alternative,
creating enough backends to reach 0 exists but can take some time). Setting
it to -10 and then opening 10 connections will eventually generate the
"could not attach" error.

This patch fixes this issue by releasing pgStat DSA with
dsa_release_in_place during pgStat shutdown to correctly decrement the
refcnt.

Regards,
Anthonin

Attachments:

v1-0001-Fix-possible-overflow-of-pg_stat-DSA-s-refcnt.patchapplication/octet-stream; name=v1-0001-Fix-possible-overflow-of-pg_stat-DSA-s-refcnt.patchDownload
From d18e72af178a25cef1d4095de6eb13684e8c987f Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Date: Tue, 25 Jun 2024 15:54:55 +0200
Subject: Fix possible overflow of pg_stat DSA's refcnt

During backend initialisation, pgStat DSA is attached using
dsa_attach_in_place with a NULL segment. The NULL segment means that
there's no callback to release the DSA when the process exits.

pgstat_detach_shmem only calls dsa_detach which, as mentioned in the
function's comment, doesn't include releasing and thus doesn't decrement
the reference count of pgStat DSA.

Thus, every time a backend is created (new connection or new parallel
worker), pgStat DSA's refcnt is increased and never decreased. It will
eventually overflow and eventually reach 0, triggering the "could not
attach to dynamic shared area" error on all newly created backends. When
this state is reached, the only way to recover is to restart the
instance to reset the counter.

This patch fixes the issue by releasing pgStat DSA with
dsa_release_in_place during pgStat shutdown to decrement refcnt.
---
 src/backend/utils/activity/pgstat_shmem.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index fb79c5e771..51008e9998 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -246,6 +246,13 @@ pgstat_detach_shmem(void)
 	pgStatLocal.shared_hash = NULL;
 
 	dsa_detach(pgStatLocal.dsa);
+
+	/*
+	 * Detach doesn't decrement dsa's refcnt and since no segment was provided
+	 * when attaching to the dsa, no cleanup callbacks are registered. We need
+	 * to manually release dsa to correctly decrement dsa's refcnt
+	 */
+	dsa_release_in_place(pgStatLocal.shmem->raw_dsa_area);
 	pgStatLocal.dsa = NULL;
 }
 
-- 
2.39.3 (Apple Git-146)

#2Michael Paquier
michael@paquier.xyz
In reply to: Anthonin Bonnefoy (#1)
Re: Fix possible overflow of pg_stat DSA's refcnt

On Tue, Jun 25, 2024 at 05:01:55PM +0200, Anthonin Bonnefoy wrote:

During backend initialisation, pgStat DSA is attached using
dsa_attach_in_place with a NULL segment. The NULL segment means that
there's no callback to release the DSA when the process exits.
pgstat_detach_shmem only calls dsa_detach which, as mentioned in the
function's comment, doesn't include releasing and doesn't decrement the
reference count of pgStat DSA.

Thus, every time a backend is created, pgStat DSA's refcnt is incremented
but never decremented when the backend shutdown. It will eventually
overflow and reach 0, triggering the "could not attach to dynamic shared
area" error on all newly created backends. When this state is reached, the
only way to recover is to restart the db to reset the counter.

Very good catch! It looks like you have seen that in the field, then.
Sad face.

This patch fixes this issue by releasing pgStat DSA with
dsa_release_in_place during pgStat shutdown to correctly decrement the
refcnt.

Sounds logic to me to do that in the pgstat shutdown callback, ordered
with the dsa_detach calls in a single location rather than registering
a different callback to do the same job. Will fix and backpatch,
thanks for the report!
--
Michael

#3Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Michael Paquier (#2)
Re: Fix possible overflow of pg_stat DSA's refcnt

On Wed, Jun 26, 2024 at 7:40 AM Michael Paquier <michael@paquier.xyz> wrote:

Very good catch! It looks like you have seen that in the field, then.
Sad face.

Yeah, this happened last week on one of our replicas (version 15.5)
last week that had 134 days uptime. We are doing a lot of parallel
queries on this cluster so the combination of high uptime plus
parallel workers creation eventually triggered the issue.

Will fix and backpatch, thanks for the report!

Thanks for handling this and for the quick answer!

Regards,
Anthonin

#4Michael Paquier
michael@paquier.xyz
In reply to: Anthonin Bonnefoy (#3)
Re: Fix possible overflow of pg_stat DSA's refcnt

On Wed, Jun 26, 2024 at 08:48:06AM +0200, Anthonin Bonnefoy wrote:

Yeah, this happened last week on one of our replicas (version 15.5)
last week that had 134 days uptime. We are doing a lot of parallel
queries on this cluster so the combination of high uptime plus
parallel workers creation eventually triggered the issue.

It is not surprising that it would take this much amount of time
before detecting it. I've applied the patch down to 15. Thanks a lot
for the analysis and the patch!
--
Michael