pgsql: Teach DSM registry to ERROR if attaching to an uninitialized ent
Teach DSM registry to ERROR if attaching to an uninitialized entry.
If DSM entry initialization fails, backends could try to use an
uninitialized DSM segment, DSA, or dshash table (since the entry is
still added to the registry). To fix, keep track of whether
initialization completed, and ERROR if a backend tries to attach to
an uninitialized entry. We could instead retry initialization as
needed, but that seemed complicated, error prone, and unlikely to
help most cases. Furthermore, such problems probably indicate a
coding error.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: /messages/by-id/dd36d384-55df-4fc2-825c-5bc56c950fa9@gmail.com
Backpatch-through: 17
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/1165a933aab1355757a43cfd9193b6cce06f573b
Modified Files
--------------
src/backend/storage/ipc/dsm_registry.c | 34 ++++++++++++++++++++++++++++++----
1 file changed, 30 insertions(+), 4 deletions(-)
On Wed, Nov 12, 2025 at 3:31 PM Nathan Bossart <nathan@postgresql.org> wrote:
Teach DSM registry to ERROR if attaching to an uninitialized entry.
If DSM entry initialization fails, backends could try to use an
uninitialized DSM segment, DSA, or dshash table (since the entry is
still added to the registry). To fix, keep track of whether
initialization completed, and ERROR if a backend tries to attach to
an uninitialized entry. We could instead retry initialization as
needed, but that seemed complicated, error prone, and unlikely to
help most cases. Furthermore, such problems probably indicate a
coding error.
Having read the thread that led to this commit, I agree that this is
an improvement, but it still doesn't seem like a great situation to
me. Maybe I'm misunderstanding, but it seems like once the
initialization has failed, there's no way to retry: you can't retry
the initialization on the existing segment, and you can't delete it
and create a new segment. If so, that means your only option for
restoring proper function after an initialization failure is to bounce
the entire server.
Now, perhaps the chances of an initialization failure in practice are
quite low. Maybe a typical initialization function won't do anything
that could fail. But it just seems weird to me to design something
that thinks errors are likely enough that it's worth having some
amount of mechanism to deal with them, but unlikely enough that it's
not worth making that mechanism more robust.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 10:44:31AM -0500, Robert Haas wrote:
Now, perhaps the chances of an initialization failure in practice are
quite low. Maybe a typical initialization function won't do anything
that could fail. But it just seems weird to me to design something
that thinks errors are likely enough that it's worth having some
amount of mechanism to deal with them, but unlikely enough that it's
not worth making that mechanism more robust.
Well, I will attempt to convince you that is indeed the case, although I'm
not sure I will succeed...
ISTM that besides pure coding errors, the most likely reasons for an ERROR
during DSM segment initialization are things like out-of-memory, too many
LWLock tranches registered, etc. In those cases, do we want every single
backend that tries to attach to the segment to retry and most likely fail
again? And what if the initialization callback completed halfway before
failing? Do we expect every extension author to carefully craft their
callbacks to handle that?
On the other hand, if we don't handle ERRORs and someone does run into an
OOM or whatnot, do we want other backends to attach and use the unitialized
segment (leading to things like seg-faults)? Tracking whether an entry was
initialized seems like cheap insurance against these problems.
Finally, the only report from the field about any of this involves SQLsmith
allocating tons of LWLock tranches via test_dsa, which has been fixed
independently. So I have no evidence to suggest that we need to make it
more robust. In fact, without this one arguably contrived report, I
probably wouldn't have done anything at all.
I'll grant you that this is a judgment call. I tried to find a good
middle ground, but I would be unsurprised to learn that other committers
would have done things differently.
--
nathan
On Thu, Nov 20, 2025 at 11:12 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
ISTM that besides pure coding errors, the most likely reasons for an ERROR
during DSM segment initialization are things like out-of-memory, too many
LWLock tranches registered, etc. In those cases, do we want every single
backend that tries to attach to the segment to retry and most likely fail
again? And what if the initialization callback completed halfway before
failing? Do we expect every extension author to carefully craft their
callbacks to handle that?
Well, I don't want to pretend like I know all the answers here. I do
think it's pretty clear that making it the initialization callback's
job to deal with a partially-failed initialization would be crazy.
What I'd be inclined to do after a failure is tear down the entire
segment. Even if there's no mechanism to retry, you're saving the
memory that the segment would have consumed. And maybe there is a
mechanism to retry. If these initialization callbacks are cheap
enough, maybe having every backend try is not really a big deal --
maybe it's good, insofar as an interactive session would be more
likely to see an error indicating the real cause of failure than a
fake error that basically tells you at some point in the past
something went wrong. If repeated failures are too painful, another
idea could be to just mark the entry as being in a failed state (maybe
with a timestamp) and let the extension decide when it wants to press
the reset button. Or maybe that's all too complicated for too little
utility. I'm not sure. My perception is that this is setting a
significantly lower standard for error tolerance than what we
typically seek to achieve, but sometimes my perceptions are wrong, and
sometimes better options are hard to come by.
On the other hand, if we don't handle ERRORs and someone does run into an
OOM or whatnot, do we want other backends to attach and use the unitialized
segment (leading to things like seg-faults)? Tracking whether an entry was
initialized seems like cheap insurance against these problems.
That's fair.
Finally, the only report from the field about any of this involves SQLsmith
allocating tons of LWLock tranches via test_dsa, which has been fixed
independently. So I have no evidence to suggest that we need to make it
more robust. In fact, without this one arguably contrived report, I
probably wouldn't have done anything at all.
Yeah, I'm not clear who is using this or how, so that makes it hard to judge.
I'll grant you that this is a judgment call. I tried to find a good
middle ground, but I would be unsurprised to learn that other committers
would have done things differently.
Such is life!
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 01:18:56PM -0500, Robert Haas wrote:
What I'd be inclined to do after a failure is tear down the entire
segment. Even if there's no mechanism to retry, you're saving the
memory that the segment would have consumed.
Unpinning/detaching the segment/DSA/dshash table and deleting the DSM
registry entry in a PG_CATCH block scares me a little, but it might be
doable.
Or maybe that's all too complicated for too little utility. I'm not sure.
That's ultimately where I landed.
My perception is that this is setting a significantly lower standard for
error tolerance than what we typically seek to achieve, but sometimes my
perceptions are wrong, and sometimes better options are hard to come by.
Another thing that might be subconsciously guiding my decisions here is the
existing behavior when a shmem request/startup hook ERRORs (server startup
fails). I'd expect DSM registry users to be doing similar things in their
initialization callbacks, and AFAIK this behavior hasn't been a source of
complaints.
--
nathan
On Thu, Nov 20, 2025 at 6:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
Unpinning/detaching the segment/DSA/dshash table and deleting the DSM
registry entry in a PG_CATCH block scares me a little, but it might be
doable.
It seems a bit weird to be doing explicit unpinning in a PG_CATCH
block. Ideally you'd want to postpone the pinning until initialization
has succeeded, so that if you fail before that, transaction cleanup
takes care of it automatically. Alternatively, destroying what existed
before could be deferred until later, when an as-yet-unfailed
transaction stumbles across the tombstone.
Another thing that might be subconsciously guiding my decisions here is the
existing behavior when a shmem request/startup hook ERRORs (server startup
fails). I'd expect DSM registry users to be doing similar things in their
initialization callbacks, and AFAIK this behavior hasn't been a source of
complaints.
What bugs me here is the fact that you have a perfectly good working
server that you can't "salvage". If the server had failed at startup,
that would have sucked, but you would have been forced into correcting
the problem (or giving up on the extension) and restarting, so once
the server gets up and running, it's always in a good state. With this
mechanism, you can get a running server that's stuck in this
failed-initialization state, and there's no way for the DBA to do
anything except by bouncing the entire server. That seems like it
could be really frustrating. Now, if it's the case that resetting this
mechanism wouldn't fundamentally fix anything because the
reinitialization attempt would inevitably fail for the same reason,
then I suppose it's not so bad, but I'm not sure that would always be
true. I just feel like one-way state transitions from good->bad are
undesirable. One way of viewing log levels like ERROR, FATAL, and
PANIC is that they force you to do a reset of the transaction,
session, or entire server to get back to a good state, but here you
just get stuck in the bad one.
Am I worrying too much? Possibly! But as I said to David on another
thread this morning, it's better to worry on pgsql-hackers before any
problem happens than to start worrying after something bad happens in
a customer situation.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Sat, Nov 22, 2025 at 10:25:48AM -0500, Robert Haas wrote:
On Thu, Nov 20, 2025 at 6:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
Unpinning/detaching the segment/DSA/dshash table and deleting the DSM
registry entry in a PG_CATCH block scares me a little, but it might be
doable.It seems a bit weird to be doing explicit unpinning in a PG_CATCH
block. Ideally you'd want to postpone the pinning until initialization
has succeeded, so that if you fail before that, transaction cleanup
takes care of it automatically. Alternatively, destroying what existed
before could be deferred until later, when an as-yet-unfailed
transaction stumbles across the tombstone.
Oh, yeah, good idea.
Am I worrying too much? Possibly! But as I said to David on another
thread this morning, it's better to worry on pgsql-hackers before any
problem happens than to start worrying after something bad happens in
a customer situation.
I'll give what you suggested a try. It seems reasonable enough.
--
nathan
On Sat, Nov 22, 2025 at 01:41:21PM -0600, Nathan Bossart wrote:
I'll give what you suggested a try. It seems reasonable enough.
Here is a mostly-untested first try.
--
nathan
Attachments:
v1-0001-Revert-Teach-DSM-registry-to-ERROR-if-attaching-t.patchtext/plain; charset=us-asciiDownload
From e639d70a10f6288034ebe433a22b9dbc53562b01 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 11:03:52 -0600
Subject: [PATCH v1 1/2] Revert "Teach DSM registry to ERROR if attaching to an
uninitialized entry."
This reverts commit 1165a933aab1355757a43cfd9193b6cce06f573b.
---
src/backend/storage/ipc/dsm_registry.c | 34 +++-----------------------
1 file changed, 4 insertions(+), 30 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index 7eba8a8cffb..a926b9c3f32 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -93,7 +93,6 @@ typedef struct DSMRegistryEntry
{
char name[NAMEDATALEN];
DSMREntryType type;
- bool initialized;
union
{
NamedDSMState dsm;
@@ -217,7 +216,6 @@ GetNamedDSMSegment(const char *name, size_t size,
dsm_segment *seg;
entry->type = DSMR_ENTRY_TYPE_DSM;
- entry->initialized = false;
/* Initialize the segment. */
seg = dsm_create(size, 0);
@@ -230,21 +228,13 @@ GetNamedDSMSegment(const char *name, size_t size,
if (init_callback)
(*init_callback) (ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSM)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSM segment does not match type of existing entry")));
else if (entry->dsm.size != size)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match size of existing entry",
- name)));
+ (errmsg("requested DSM segment size does not match size of existing segment")));
else
{
NamedDSMState *state = &entry->dsm;
@@ -307,7 +297,6 @@ GetNamedDSA(const char *name, bool *found)
NamedDSAState *state = &entry->dsa;
entry->type = DSMR_ENTRY_TYPE_DSA;
- entry->initialized = false;
/* Initialize the LWLock tranche for the DSA. */
state->tranche = LWLockNewTrancheId(name);
@@ -319,17 +308,10 @@ GetNamedDSA(const char *name, bool *found)
/* Store handle for other backends to use. */
state->handle = dsa_get_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSA)
ereport(ERROR,
- (errmsg("requested DSA \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSA \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSA does not match type of existing entry")));
else
{
NamedDSAState *state = &entry->dsa;
@@ -390,7 +372,6 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
dsa_area *dsa;
entry->type = DSMR_ENTRY_TYPE_DSH;
- entry->initialized = false;
/* Initialize the LWLock tranche for the hash table. */
dsh_state->tranche = LWLockNewTrancheId(name);
@@ -408,17 +389,10 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSH)
ereport(ERROR,
- (errmsg("requested DSHash \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSHash \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSHash does not match type of existing entry")));
else
{
NamedDSHState *dsh_state = &entry->dsh;
--
2.39.5 (Apple Git-154)
v1-0002-handle-ERRORs-in-DSM-registry-functions.patchtext/plain; charset=us-asciiDownload
From 6da47303c5e5acaca4cc9808316a0457f6c6f0af Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 12:28:23 -0600
Subject: [PATCH v1 2/2] handle ERRORs in DSM registry functions
---
src/backend/storage/ipc/dsm_registry.c | 129 ++++++++++++++++++++++---
1 file changed, 115 insertions(+), 14 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index a926b9c3f32..7d1b584d166 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -154,10 +154,24 @@ init_dsm_registry(void)
if (DSMRegistryCtx->dshh == DSHASH_HANDLE_INVALID)
{
/* Initialize dynamic shared hash table for registry. */
- dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA);
+ PG_TRY();
+ {
+ dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA);
+ dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
+ }
+ PG_CATCH();
+ {
+ if (dsm_registry_dsa)
+ {
+ dsa_detach(dsm_registry_dsa);
+ dsm_registry_dsa = NULL;
+ }
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
dsa_pin(dsm_registry_dsa);
dsa_pin_mapping(dsm_registry_dsa);
- dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
/* Store handles in shared memory for other backends to use. */
DSMRegistryCtx->dsah = dsa_get_handle(dsm_registry_dsa);
@@ -216,18 +230,38 @@ GetNamedDSMSegment(const char *name, size_t size,
dsm_segment *seg;
entry->type = DSMR_ENTRY_TYPE_DSM;
+ state->handle = DSM_HANDLE_INVALID;
/* Initialize the segment. */
- seg = dsm_create(size, 0);
+ PG_TRY();
+ {
+ seg = dsm_create(size, 0);
+ }
+ PG_CATCH();
+ {
+ dshash_delete_entry(dsm_registry_table, entry);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ PG_TRY();
+ {
+ ret = dsm_segment_address(seg);
+ if (init_callback)
+ (*init_callback) (ret);
+ }
+ PG_CATCH();
+ {
+ dsm_detach(seg);
+ dshash_delete_entry(dsm_registry_table, entry);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
dsm_pin_segment(seg);
dsm_pin_mapping(seg);
state->handle = dsm_segment_handle(seg);
state->size = size;
- ret = dsm_segment_address(seg);
-
- if (init_callback)
- (*init_callback) (ret);
}
else if (entry->type != DSMR_ENTRY_TYPE_DSM)
ereport(ERROR,
@@ -297,9 +331,19 @@ GetNamedDSA(const char *name, bool *found)
NamedDSAState *state = &entry->dsa;
entry->type = DSMR_ENTRY_TYPE_DSA;
+ state->handle = DSA_HANDLE_INVALID;
/* Initialize the LWLock tranche for the DSA. */
- state->tranche = LWLockNewTrancheId(name);
+ PG_TRY();
+ {
+ state->tranche = LWLockNewTrancheId(name);
+ }
+ PG_CATCH();
+ {
+ dshash_delete_entry(dsm_registry_table, entry);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
/* Initialize the DSA. */
ret = dsa_create(state->tranche);
@@ -312,6 +356,16 @@ GetNamedDSA(const char *name, bool *found)
else if (entry->type != DSMR_ENTRY_TYPE_DSA)
ereport(ERROR,
(errmsg("requested DSA does not match type of existing entry")));
+ else if (entry->dsa.handle == DSA_HANDLE_INVALID)
+ {
+ NamedDSAState *state = &entry->dsa;
+
+ ret = dsa_create(state->tranche);
+
+ dsa_pin(ret);
+ dsa_pin_mapping(ret);
+ state->handle = dsa_get_handle(ret);
+ }
else
{
NamedDSAState *state = &entry->dsa;
@@ -372,19 +426,40 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
dsa_area *dsa;
entry->type = DSMR_ENTRY_TYPE_DSH;
+ dsh_state->dsa_handle = DSA_HANDLE_INVALID;
+ dsh_state->dsh_handle = DSHASH_HANDLE_INVALID;
/* Initialize the LWLock tranche for the hash table. */
- dsh_state->tranche = LWLockNewTrancheId(name);
+ PG_TRY();
+ {
+ dsh_state->tranche = LWLockNewTrancheId(name);
+ }
+ PG_CATCH();
+ {
+ dshash_delete_entry(dsm_registry_table, entry);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
/* Initialize the DSA for the hash table. */
dsa = dsa_create(dsh_state->tranche);
- dsa_pin(dsa);
- dsa_pin_mapping(dsa);
/* Initialize the dshash table. */
- memcpy(¶ms_copy, params, sizeof(dshash_parameters));
- params_copy.tranche_id = dsh_state->tranche;
- ret = dshash_create(dsa, ¶ms_copy, NULL);
+ PG_TRY();
+ {
+ memcpy(¶ms_copy, params, sizeof(dshash_parameters));
+ params_copy.tranche_id = dsh_state->tranche;
+ ret = dshash_create(dsa, ¶ms_copy, NULL);
+ }
+ PG_CATCH();
+ {
+ dsa_detach(dsa);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ dsa_pin(dsa);
+ dsa_pin_mapping(dsa);
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
@@ -393,6 +468,32 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
else if (entry->type != DSMR_ENTRY_TYPE_DSH)
ereport(ERROR,
(errmsg("requested DSHash does not match type of existing entry")));
+ else if (entry->dsh.dsa_handle == DSA_HANDLE_INVALID)
+ {
+ NamedDSHState *dsh_state = &entry->dsh;
+ dshash_parameters params_copy;
+ dsa_area *dsa;
+
+ dsa = dsa_create(dsh_state->tranche);
+
+ PG_TRY();
+ {
+ memcpy(¶ms_copy, params, sizeof(dshash_parameters));
+ params_copy.tranche_id = dsh_state->tranche;
+ ret = dshash_create(dsa, ¶ms_copy, NULL);
+ }
+ PG_CATCH();
+ {
+ dsa_detach(dsa);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ dsa_pin(dsa);
+ dsa_pin_mapping(dsa);
+ dsh_state->dsa_handle = dsa_get_handle(dsa);
+ dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
+ }
else
{
NamedDSHState *dsh_state = &entry->dsh;
--
2.39.5 (Apple Git-154)
On Mon, Nov 24, 2025 at 12:36:25PM -0600, Nathan Bossart wrote:
On Sat, Nov 22, 2025 at 01:41:21PM -0600, Nathan Bossart wrote:
I'll give what you suggested a try. It seems reasonable enough.
Here is a mostly-untested first try.
I think 0002 was more complicated than necessary. Here's a second try.
--
nathan
Attachments:
v2-0001-Revert-Teach-DSM-registry-to-ERROR-if-attaching-t.patchtext/plain; charset=us-asciiDownload
From 55956903a8339c754ba96ff0f52ddb38f7ae184b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 11:03:52 -0600
Subject: [PATCH v2 1/2] Revert "Teach DSM registry to ERROR if attaching to an
uninitialized entry."
This reverts commit 1165a933aab1355757a43cfd9193b6cce06f573b.
---
src/backend/storage/ipc/dsm_registry.c | 34 +++-----------------------
1 file changed, 4 insertions(+), 30 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index 7eba8a8cffb..a926b9c3f32 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -93,7 +93,6 @@ typedef struct DSMRegistryEntry
{
char name[NAMEDATALEN];
DSMREntryType type;
- bool initialized;
union
{
NamedDSMState dsm;
@@ -217,7 +216,6 @@ GetNamedDSMSegment(const char *name, size_t size,
dsm_segment *seg;
entry->type = DSMR_ENTRY_TYPE_DSM;
- entry->initialized = false;
/* Initialize the segment. */
seg = dsm_create(size, 0);
@@ -230,21 +228,13 @@ GetNamedDSMSegment(const char *name, size_t size,
if (init_callback)
(*init_callback) (ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSM)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSM segment does not match type of existing entry")));
else if (entry->dsm.size != size)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match size of existing entry",
- name)));
+ (errmsg("requested DSM segment size does not match size of existing segment")));
else
{
NamedDSMState *state = &entry->dsm;
@@ -307,7 +297,6 @@ GetNamedDSA(const char *name, bool *found)
NamedDSAState *state = &entry->dsa;
entry->type = DSMR_ENTRY_TYPE_DSA;
- entry->initialized = false;
/* Initialize the LWLock tranche for the DSA. */
state->tranche = LWLockNewTrancheId(name);
@@ -319,17 +308,10 @@ GetNamedDSA(const char *name, bool *found)
/* Store handle for other backends to use. */
state->handle = dsa_get_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSA)
ereport(ERROR,
- (errmsg("requested DSA \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSA \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSA does not match type of existing entry")));
else
{
NamedDSAState *state = &entry->dsa;
@@ -390,7 +372,6 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
dsa_area *dsa;
entry->type = DSMR_ENTRY_TYPE_DSH;
- entry->initialized = false;
/* Initialize the LWLock tranche for the hash table. */
dsh_state->tranche = LWLockNewTrancheId(name);
@@ -408,17 +389,10 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSH)
ereport(ERROR,
- (errmsg("requested DSHash \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSHash \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSHash does not match type of existing entry")));
else
{
NamedDSHState *dsh_state = &entry->dsh;
--
2.39.5 (Apple Git-154)
v2-0002-handle-ERRORs-in-DSM-registry-functions.patchtext/plain; charset=us-asciiDownload
From 3529f2e17b85537198d8a6fa783ca9c0b0522547 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 12:28:23 -0600
Subject: [PATCH v2 2/2] handle ERRORs in DSM registry functions
---
src/backend/storage/ipc/dsm_registry.c | 85 +++++++++++++++++++++++---
1 file changed, 75 insertions(+), 10 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index a926b9c3f32..bae67d15add 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -155,9 +155,10 @@ init_dsm_registry(void)
{
/* Initialize dynamic shared hash table for registry. */
dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA);
+ dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
+
dsa_pin(dsm_registry_dsa);
dsa_pin_mapping(dsm_registry_dsa);
- dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
/* Store handles in shared memory for other backends to use. */
DSMRegistryCtx->dsah = dsa_get_handle(dsm_registry_dsa);
@@ -218,16 +219,24 @@ GetNamedDSMSegment(const char *name, size_t size,
entry->type = DSMR_ENTRY_TYPE_DSM;
/* Initialize the segment. */
- seg = dsm_create(size, 0);
+ PG_TRY();
+ {
+ seg = dsm_create(size, 0);
+ ret = dsm_segment_address(seg);
+ if (init_callback)
+ (*init_callback) (ret);
+ }
+ PG_CATCH();
+ {
+ dshash_delete_entry(dsm_registry_table, entry);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
dsm_pin_segment(seg);
dsm_pin_mapping(seg);
state->handle = dsm_segment_handle(seg);
state->size = size;
- ret = dsm_segment_address(seg);
-
- if (init_callback)
- (*init_callback) (ret);
}
else if (entry->type != DSMR_ENTRY_TYPE_DSM)
ereport(ERROR,
@@ -297,12 +306,23 @@ GetNamedDSA(const char *name, bool *found)
NamedDSAState *state = &entry->dsa;
entry->type = DSMR_ENTRY_TYPE_DSA;
+ state->handle = DSA_HANDLE_INVALID;
/* Initialize the LWLock tranche for the DSA. */
- state->tranche = LWLockNewTrancheId(name);
+ PG_TRY();
+ {
+ state->tranche = LWLockNewTrancheId(name);
+ }
+ PG_CATCH();
+ {
+ dshash_delete_entry(dsm_registry_table, entry);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
/* Initialize the DSA. */
ret = dsa_create(state->tranche);
+
dsa_pin(ret);
dsa_pin_mapping(ret);
@@ -312,6 +332,19 @@ GetNamedDSA(const char *name, bool *found)
else if (entry->type != DSMR_ENTRY_TYPE_DSA)
ereport(ERROR,
(errmsg("requested DSA does not match type of existing entry")));
+ else if (entry->dsa.handle == DSA_HANDLE_INVALID)
+ {
+ NamedDSAState *state = &entry->dsa;
+
+ /* Initialize the DSA. */
+ ret = dsa_create(state->tranche);
+
+ dsa_pin(ret);
+ dsa_pin_mapping(ret);
+
+ /* Store handle for other backends to use. */
+ state->handle = dsa_get_handle(ret);
+ }
else
{
NamedDSAState *state = &entry->dsa;
@@ -372,20 +405,31 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
dsa_area *dsa;
entry->type = DSMR_ENTRY_TYPE_DSH;
+ dsh_state->dsa_handle = DSA_HANDLE_INVALID;
/* Initialize the LWLock tranche for the hash table. */
- dsh_state->tranche = LWLockNewTrancheId(name);
+ PG_TRY();
+ {
+ dsh_state->tranche = LWLockNewTrancheId(name);
+ }
+ PG_CATCH();
+ {
+ dshash_delete_entry(dsm_registry_table, entry);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
/* Initialize the DSA for the hash table. */
dsa = dsa_create(dsh_state->tranche);
- dsa_pin(dsa);
- dsa_pin_mapping(dsa);
/* Initialize the dshash table. */
memcpy(¶ms_copy, params, sizeof(dshash_parameters));
params_copy.tranche_id = dsh_state->tranche;
ret = dshash_create(dsa, ¶ms_copy, NULL);
+ dsa_pin(dsa);
+ dsa_pin_mapping(dsa);
+
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
@@ -393,6 +437,27 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
else if (entry->type != DSMR_ENTRY_TYPE_DSH)
ereport(ERROR,
(errmsg("requested DSHash does not match type of existing entry")));
+ else if (entry->dsh.dsa_handle == DSA_HANDLE_INVALID)
+ {
+ NamedDSHState *dsh_state = &entry->dsh;
+ dshash_parameters params_copy;
+ dsa_area *dsa;
+
+ /* Initialize the DSA for the hash table. */
+ dsa = dsa_create(dsh_state->tranche);
+
+ /* Initialize the dshash table. */
+ memcpy(¶ms_copy, params, sizeof(dshash_parameters));
+ params_copy.tranche_id = dsh_state->tranche;
+ ret = dshash_create(dsa, ¶ms_copy, NULL);
+
+ dsa_pin(dsa);
+ dsa_pin_mapping(dsa);
+
+ /* Store handles for other backends to use. */
+ dsh_state->dsa_handle = dsa_get_handle(dsa);
+ dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
+ }
else
{
NamedDSHState *dsh_state = &entry->dsh;
--
2.39.5 (Apple Git-154)
On Mon, Nov 24, 2025 at 2:03 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I think 0002 was more complicated than necessary. Here's a second try.
I haven't done a full review of this and I'm not sure whether you want
me to spend more time on it, but something I notice about this version
is that the PG_CATCH() blocks only contain dshash_delete_entry()
calls. That seems good, because that means that all the other cleanup
is being handled by transaction abort (assuming that the patch isn't
buggy, which I haven't attempted to verify). However, it does make me
wonder if we could also find a way to postpone adding the dshash
entries so that we don't need to do anything in PG_CATCH() blocks at
all. I'm guessing that the reason why that doesn't easily work is
because you're relying on those locks to prevent multiple backends
from doing the same initialization?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Nov 24, 2025 at 04:02:18PM -0500, Robert Haas wrote:
I haven't done a full review of this and I'm not sure whether you want
me to spend more time on it...
I'd appreciate an eyeball check if you have the time.
I'm guessing that the reason why that doesn't easily work is
because you're relying on those locks to prevent multiple backends
from doing the same initialization?
For GetNamedDSMSegment(), I bet we could avoid needing a PG_CATCH by taking
the DSMRegistryLock exclusively when accessing the registry. But for
GetNamedDSA() and GetNamedDSHash(), we want to keep the half-initialized
entry so that we don't leak LWLock tranche IDs. I initially thought that
might be okay, but if every backend is retrying, you could quickly run out
of tranche IDs.
--
nathan
On Mon, Nov 24, 2025 at 4:25 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Nov 24, 2025 at 04:02:18PM -0500, Robert Haas wrote:
I haven't done a full review of this and I'm not sure whether you want
me to spend more time on it...I'd appreciate an eyeball check if you have the time.
OK. I am not too familiar with the current code but I will have a look.
I'm guessing that the reason why that doesn't easily work is
because you're relying on those locks to prevent multiple backends
from doing the same initialization?For GetNamedDSMSegment(), I bet we could avoid needing a PG_CATCH by taking
the DSMRegistryLock exclusively when accessing the registry. But for
GetNamedDSA() and GetNamedDSHash(), we want to keep the half-initialized
entry so that we don't leak LWLock tranche IDs. I initially thought that
might be okay, but if every backend is retrying, you could quickly run out
of tranche IDs.
I think that leaking tranche IDs is definitely a bad plan, even if it
didn't cause us to run out. Ideally, I think the design goal should be
to set up the structure of things so that we never need to do anything
beyond error cleanup if we fail. If we can't achieve that, well then
fine, but that is what I would view as most desirable. For instance, I
would be inclined to set up GetNamedDSA like this:
1. First or create create the dshash entry. If that fails, we haven't
done anything yet, so no problem.
2. Next, if there dshash entry doesn't have an assigned tranche ID
yet, then try to assign one. This probably won't fail. But if it does,
it will produce a meaningful error, and every future attempt will
likely produce the same, still-meaningful error. I think this is a
significant improvement over having every attempt but the first return
"requested DSA \"%s\" failed initialization," which hides the real
cause of failure.
3. Next, if the segment doesn't yet have a DSA, then try to create
one.This could fail due to lack of memory; if so, future attempts
might succeed, or might continue to fail with out-of-memory errors.
Assuming that creating the DSA is successful, pin the DSA and store
the handle into the dshash entry. On the other hand, if we don't
create a DSA here because one already exists, try to attach to it.
That could also fail, but if it does, we can still retry later.
4. Pin the mapping for the DSA that we created or attached to in the
previous step.
5. dshash_release_lock.
In other words, I'd try to create a system that always attempts to
make forward progress, but when that's not possible, fails without
permanently leaking any resources or spoiling anything for the next
backend that wants to make an attempt. One worry that we've mentioned
previously is that this might lead to every backend retrying. But as
the code is currently written, that basically happens anyway except
all of them but one produce a generic, uninformative error. Now, if we
do what I'm proposing here, we'll repeatedly try to create DSM, DSA,
and DSH objects in a way that actually allocates memory, so the
retries get a little bit more expensive. At the moment, I'm inclined
to believe this doesn't really matter. If it's true that failures are
very rare, and I suspect it is, then it doesn't really matter if the
cost of retries goes up a bit. If hypothetically they are common
enough to matter, then we definitely need a mechanism that can't get
permanently stuck in a half-initialized state. If we were to find that
retrying too many times too quickly creates other problems, my
suggested response would be to add a timestamp to the dshash entry and
limit retries to once per minute or something. However, in the absence
of evidence that we need such a mechanism, I'd be inclined to guess
that we don't.
Thoughts?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 10:37:50AM -0500, Robert Haas wrote:
1. First or create create the dshash entry. If that fails, we haven't
done anything yet, so no problem.2. Next, if there dshash entry doesn't have an assigned tranche ID
yet, then try to assign one. This probably won't fail. But if it does,
it will produce a meaningful error, and every future attempt will
likely produce the same, still-meaningful error. I think this is a
significant improvement over having every attempt but the first return
"requested DSA \"%s\" failed initialization," which hides the real
cause of failure.3. Next, if the segment doesn't yet have a DSA, then try to create
one.This could fail due to lack of memory; if so, future attempts
might succeed, or might continue to fail with out-of-memory errors.
Assuming that creating the DSA is successful, pin the DSA and store
the handle into the dshash entry. On the other hand, if we don't
create a DSA here because one already exists, try to attach to it.
That could also fail, but if it does, we can still retry later.4. Pin the mapping for the DSA that we created or attached to in the
previous step.5. dshash_release_lock.
I think the only difference between this and 0002 is in step 2. In 0002,
if allocating a tranche fails, we remove the entry from the registry so
that another backend can try again later. Otherwise, we retain the entry
with just the tranche ID set, even if later setup fails. I believe that
achieves basically the same thing, and is perhaps a tad simpler to code.
In other words, I'd try to create a system that always attempts to
make forward progress, but when that's not possible, fails without
permanently leaking any resources or spoiling anything for the next
backend that wants to make an attempt. One worry that we've mentioned
previously is that this might lead to every backend retrying. But as
the code is currently written, that basically happens anyway except
all of them but one produce a generic, uninformative error. Now, if we
do what I'm proposing here, we'll repeatedly try to create DSM, DSA,
and DSH objects in a way that actually allocates memory, so the
retries get a little bit more expensive. At the moment, I'm inclined
to believe this doesn't really matter. If it's true that failures are
very rare, and I suspect it is, then it doesn't really matter if the
cost of retries goes up a bit. If hypothetically they are common
enough to matter, then we definitely need a mechanism that can't get
permanently stuck in a half-initialized state. If we were to find that
retrying too many times too quickly creates other problems, my
suggested response would be to add a timestamp to the dshash entry and
limit retries to once per minute or something. However, in the absence
of evidence that we need such a mechanism, I'd be inclined to guess
that we don't.Thoughts?
Yeah, IMHO this is fine. I expect most of these ERRORs to be reached
during extension development or out-of-memory scenarios, and I don't see
how slightly more expensive retry logic would hurt anything there.
--
nathan
On Tue, Nov 25, 2025 at 10:50 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
I think the only difference between this and 0002 is in step 2. In 0002,
if allocating a tranche fails, we remove the entry from the registry so
that another backend can try again later. Otherwise, we retain the entry
with just the tranche ID set, even if later setup fails. I believe that
achieves basically the same thing, and is perhaps a tad simpler to code.
how slightly more expensive retry logic would hurt anything there.
The downside is that then we have to rely on PG_CATCH() to make things
whole. I am not sure that there's any problem with that, but I'm also
not sure that there isn't. The timing of PG_CATCH() block execution
often creates bugs, because it runs before (sub)transaction abort.
That means that there's a real risk that you try to acquire an LWLock
you already hold, for example. It's a lot easier to be confident that
cleanup actions will reliably succeed when they run inside the
transaction abort path that knows the order in which various resources
should be released. Generally, I think it's PG_CATCH() is fine if
you're releasing a resource that is decoupled from everything else,
like using a third-party library's free function to free memory
allocated by that library. But if you're releasing PostgreSQL
resources that are layered on top of other PostgreSQL resources, like
a DSA that depends on DSM and LWLock, I think it's a lot more
difficult to be certain that you aren't going to end up trying to
release the same stuff multiple times or in the wrong order. I'm not
saying you can't make it work, but I've banged my head on this
particular doorframe enough times that my reflex is to duck.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 11:25:09AM -0500, Robert Haas wrote:
The downside is that then we have to rely on PG_CATCH() to make things
whole. I am not sure that there's any problem with that, but I'm also
not sure that there isn't. The timing of PG_CATCH() block execution
often creates bugs, because it runs before (sub)transaction abort.
That means that there's a real risk that you try to acquire an LWLock
you already hold, for example. It's a lot easier to be confident that
cleanup actions will reliably succeed when they run inside the
transaction abort path that knows the order in which various resources
should be released. Generally, I think it's PG_CATCH() is fine if
you're releasing a resource that is decoupled from everything else,
like using a third-party library's free function to free memory
allocated by that library. But if you're releasing PostgreSQL
resources that are layered on top of other PostgreSQL resources, like
a DSA that depends on DSM and LWLock, I think it's a lot more
difficult to be certain that you aren't going to end up trying to
release the same stuff multiple times or in the wrong order. I'm not
saying you can't make it work, but I've banged my head on this
particular doorframe enough times that my reflex is to duck.
I gave your idea a try, and I like the result a lot. IMHO it makes the
code much easier to reason about.
--
nathan
Attachments:
v3-0001-Revert-Teach-DSM-registry-to-ERROR-if-attaching-t.patchtext/plain; charset=us-asciiDownload
From 3fddb959db906ff6560d73c08ba7695812020275 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 11:03:52 -0600
Subject: [PATCH v3 1/2] Revert "Teach DSM registry to ERROR if attaching to an
uninitialized entry."
This reverts commit 1165a933aab1355757a43cfd9193b6cce06f573b.
---
src/backend/storage/ipc/dsm_registry.c | 34 +++-----------------------
1 file changed, 4 insertions(+), 30 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index 7eba8a8cffb..a926b9c3f32 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -93,7 +93,6 @@ typedef struct DSMRegistryEntry
{
char name[NAMEDATALEN];
DSMREntryType type;
- bool initialized;
union
{
NamedDSMState dsm;
@@ -217,7 +216,6 @@ GetNamedDSMSegment(const char *name, size_t size,
dsm_segment *seg;
entry->type = DSMR_ENTRY_TYPE_DSM;
- entry->initialized = false;
/* Initialize the segment. */
seg = dsm_create(size, 0);
@@ -230,21 +228,13 @@ GetNamedDSMSegment(const char *name, size_t size,
if (init_callback)
(*init_callback) (ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSM)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSM segment does not match type of existing entry")));
else if (entry->dsm.size != size)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match size of existing entry",
- name)));
+ (errmsg("requested DSM segment size does not match size of existing segment")));
else
{
NamedDSMState *state = &entry->dsm;
@@ -307,7 +297,6 @@ GetNamedDSA(const char *name, bool *found)
NamedDSAState *state = &entry->dsa;
entry->type = DSMR_ENTRY_TYPE_DSA;
- entry->initialized = false;
/* Initialize the LWLock tranche for the DSA. */
state->tranche = LWLockNewTrancheId(name);
@@ -319,17 +308,10 @@ GetNamedDSA(const char *name, bool *found)
/* Store handle for other backends to use. */
state->handle = dsa_get_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSA)
ereport(ERROR,
- (errmsg("requested DSA \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSA \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSA does not match type of existing entry")));
else
{
NamedDSAState *state = &entry->dsa;
@@ -390,7 +372,6 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
dsa_area *dsa;
entry->type = DSMR_ENTRY_TYPE_DSH;
- entry->initialized = false;
/* Initialize the LWLock tranche for the hash table. */
dsh_state->tranche = LWLockNewTrancheId(name);
@@ -408,17 +389,10 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSH)
ereport(ERROR,
- (errmsg("requested DSHash \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSHash \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSHash does not match type of existing entry")));
else
{
NamedDSHState *dsh_state = &entry->dsh;
--
2.39.5 (Apple Git-154)
v3-0002-handle-ERRORs-in-DSM-registry-functions.patchtext/plain; charset=us-asciiDownload
From c40884c967755b3630d0419a7d6585d3f4aae170 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 12:28:23 -0600
Subject: [PATCH v3 2/2] handle ERRORs in DSM registry functions
---
src/backend/storage/ipc/dsm_registry.c | 110 +++++++++++++++----------
1 file changed, 67 insertions(+), 43 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index a926b9c3f32..5290cc1c8e3 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -155,9 +155,10 @@ init_dsm_registry(void)
{
/* Initialize dynamic shared hash table for registry. */
dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA);
+ dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
+
dsa_pin(dsm_registry_dsa);
dsa_pin_mapping(dsm_registry_dsa);
- dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
/* Store handles in shared memory for other backends to use. */
DSMRegistryCtx->dsah = dsa_get_handle(dsm_registry_dsa);
@@ -188,6 +189,8 @@ GetNamedDSMSegment(const char *name, size_t size,
DSMRegistryEntry *entry;
MemoryContext oldcontext;
void *ret;
+ NamedDSMState *state;
+ dsm_segment *seg;
Assert(found);
@@ -210,36 +213,36 @@ GetNamedDSMSegment(const char *name, size_t size,
init_dsm_registry();
entry = dshash_find_or_insert(dsm_registry_table, name, found);
+ state = &entry->dsm;
if (!(*found))
{
- NamedDSMState *state = &entry->dsm;
- dsm_segment *seg;
-
entry->type = DSMR_ENTRY_TYPE_DSM;
+ state->handle = DSM_HANDLE_INVALID;
+ state->size = size;
+ }
+ else if (entry->type != DSMR_ENTRY_TYPE_DSM)
+ ereport(ERROR,
+ (errmsg("requested DSM segment does not match type of existing entry")));
+ else if (state->size != size)
+ ereport(ERROR,
+ (errmsg("requested DSM segment size does not match size of existing segment")));
+
+ if (state->handle == DSM_HANDLE_INVALID)
+ {
+ *found = false;
/* Initialize the segment. */
seg = dsm_create(size, 0);
+ if (init_callback)
+ (*init_callback) (dsm_segment_address(seg));
+
dsm_pin_segment(seg);
dsm_pin_mapping(seg);
state->handle = dsm_segment_handle(seg);
- state->size = size;
- ret = dsm_segment_address(seg);
-
- if (init_callback)
- (*init_callback) (ret);
}
- else if (entry->type != DSMR_ENTRY_TYPE_DSM)
- ereport(ERROR,
- (errmsg("requested DSM segment does not match type of existing entry")));
- else if (entry->dsm.size != size)
- ereport(ERROR,
- (errmsg("requested DSM segment size does not match size of existing segment")));
else
{
- NamedDSMState *state = &entry->dsm;
- dsm_segment *seg;
-
/* If the existing segment is not already attached, attach it now. */
seg = dsm_find_mapping(state->handle);
if (seg == NULL)
@@ -250,10 +253,9 @@ GetNamedDSMSegment(const char *name, size_t size,
dsm_pin_mapping(seg);
}
-
- ret = dsm_segment_address(seg);
}
+ ret = dsm_segment_address(seg);
dshash_release_lock(dsm_registry_table, entry);
MemoryContextSwitchTo(oldcontext);
@@ -274,6 +276,7 @@ GetNamedDSA(const char *name, bool *found)
DSMRegistryEntry *entry;
MemoryContext oldcontext;
dsa_area *ret;
+ NamedDSAState *state;
Assert(found);
@@ -292,14 +295,28 @@ GetNamedDSA(const char *name, bool *found)
init_dsm_registry();
entry = dshash_find_or_insert(dsm_registry_table, name, found);
+ state = &entry->dsa;
if (!(*found))
{
- NamedDSAState *state = &entry->dsa;
-
entry->type = DSMR_ENTRY_TYPE_DSA;
+ state->handle = DSA_HANDLE_INVALID;
+ state->tranche = -1;
+ }
+ else if (entry->type != DSMR_ENTRY_TYPE_DSA)
+ ereport(ERROR,
+ (errmsg("requested DSA does not match type of existing entry")));
+
+ if (state->tranche == -1)
+ {
+ *found = false;
/* Initialize the LWLock tranche for the DSA. */
state->tranche = LWLockNewTrancheId(name);
+ }
+
+ if (state->handle == DSA_HANDLE_INVALID)
+ {
+ *found = false;
/* Initialize the DSA. */
ret = dsa_create(state->tranche);
@@ -309,17 +326,11 @@ GetNamedDSA(const char *name, bool *found)
/* Store handle for other backends to use. */
state->handle = dsa_get_handle(ret);
}
- else if (entry->type != DSMR_ENTRY_TYPE_DSA)
+ else if (dsa_is_attached(state->handle))
ereport(ERROR,
- (errmsg("requested DSA does not match type of existing entry")));
+ (errmsg("requested DSA already attached to current process")));
else
{
- NamedDSAState *state = &entry->dsa;
-
- if (dsa_is_attached(state->handle))
- ereport(ERROR,
- (errmsg("requested DSA already attached to current process")));
-
/* Attach to existing DSA. */
ret = dsa_attach(state->handle);
dsa_pin_mapping(ret);
@@ -346,6 +357,7 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
DSMRegistryEntry *entry;
MemoryContext oldcontext;
dshash_table *ret;
+ NamedDSHState *dsh_state;
Assert(params);
Assert(found);
@@ -365,45 +377,57 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
init_dsm_registry();
entry = dshash_find_or_insert(dsm_registry_table, name, found);
+ dsh_state = &entry->dsh;
if (!(*found))
{
- NamedDSHState *dsh_state = &entry->dsh;
- dshash_parameters params_copy;
- dsa_area *dsa;
-
entry->type = DSMR_ENTRY_TYPE_DSH;
+ dsh_state->dsa_handle = DSA_HANDLE_INVALID;
+ dsh_state->dsh_handle = DSHASH_HANDLE_INVALID;
+ dsh_state->tranche = -1;
+ }
+ else if (entry->type != DSMR_ENTRY_TYPE_DSH)
+ ereport(ERROR,
+ (errmsg("requested DSHash does not match type of existing entry")));
+
+ if (dsh_state->tranche == -1)
+ {
+ *found = false;
/* Initialize the LWLock tranche for the hash table. */
dsh_state->tranche = LWLockNewTrancheId(name);
+ }
+
+ if (dsh_state->dsa_handle == DSA_HANDLE_INVALID)
+ {
+ dshash_parameters params_copy;
+ dsa_area *dsa;
+
+ *found = false;
/* Initialize the DSA for the hash table. */
dsa = dsa_create(dsh_state->tranche);
- dsa_pin(dsa);
- dsa_pin_mapping(dsa);
/* Initialize the dshash table. */
memcpy(¶ms_copy, params, sizeof(dshash_parameters));
params_copy.tranche_id = dsh_state->tranche;
ret = dshash_create(dsa, ¶ms_copy, NULL);
+ dsa_pin(dsa);
+ dsa_pin_mapping(dsa);
+
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
}
- else if (entry->type != DSMR_ENTRY_TYPE_DSH)
+ else if (dsa_is_attached(dsh_state->dsa_handle))
ereport(ERROR,
- (errmsg("requested DSHash does not match type of existing entry")));
+ (errmsg("requested DSHash already attached to current process")));
else
{
- NamedDSHState *dsh_state = &entry->dsh;
dsa_area *dsa;
/* XXX: Should we verify params matches what table was created with? */
- if (dsa_is_attached(dsh_state->dsa_handle))
- ereport(ERROR,
- (errmsg("requested DSHash already attached to current process")));
-
/* Attach to existing DSA for the hash table. */
dsa = dsa_attach(dsh_state->dsa_handle);
dsa_pin_mapping(dsa);
--
2.39.5 (Apple Git-154)
On Tue, Nov 25, 2025 at 12:45 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
I gave your idea a try, and I like the result a lot. IMHO it makes the
code much easier to reason about.
I looked through this and I agree this looks good. Thanks for working on it.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 03:58:47PM -0500, Robert Haas wrote:
On Tue, Nov 25, 2025 at 12:45 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:I gave your idea a try, and I like the result a lot. IMHO it makes the
code much easier to reason about.I looked through this and I agree this looks good. Thanks for working on it.
My tests seem happy, so I will plan on committing these patches tomorrow.
The only difference in v4 is that pg_get_dsm_registry_allocations() will no
longer show partially-initialized entries. I thought that was the best
option in this case because such entries shouldn't have any allocated
memory, and retrying GetNamed*() would set *found to false.
--
nathan
Attachments:
v4-0001-Revert-Teach-DSM-registry-to-ERROR-if-attaching-t.patchtext/plain; charset=us-asciiDownload
From 767081b66138df6038465094416502ee592e7e09 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 11:03:52 -0600
Subject: [PATCH v4 1/2] Revert "Teach DSM registry to ERROR if attaching to an
uninitialized entry."
This reverts commit 1165a933aab1355757a43cfd9193b6cce06f573b.
---
src/backend/storage/ipc/dsm_registry.c | 34 +++-----------------------
1 file changed, 4 insertions(+), 30 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index 7eba8a8cffb..a926b9c3f32 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -93,7 +93,6 @@ typedef struct DSMRegistryEntry
{
char name[NAMEDATALEN];
DSMREntryType type;
- bool initialized;
union
{
NamedDSMState dsm;
@@ -217,7 +216,6 @@ GetNamedDSMSegment(const char *name, size_t size,
dsm_segment *seg;
entry->type = DSMR_ENTRY_TYPE_DSM;
- entry->initialized = false;
/* Initialize the segment. */
seg = dsm_create(size, 0);
@@ -230,21 +228,13 @@ GetNamedDSMSegment(const char *name, size_t size,
if (init_callback)
(*init_callback) (ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSM)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSM segment does not match type of existing entry")));
else if (entry->dsm.size != size)
ereport(ERROR,
- (errmsg("requested DSM segment \"%s\" does not match size of existing entry",
- name)));
+ (errmsg("requested DSM segment size does not match size of existing segment")));
else
{
NamedDSMState *state = &entry->dsm;
@@ -307,7 +297,6 @@ GetNamedDSA(const char *name, bool *found)
NamedDSAState *state = &entry->dsa;
entry->type = DSMR_ENTRY_TYPE_DSA;
- entry->initialized = false;
/* Initialize the LWLock tranche for the DSA. */
state->tranche = LWLockNewTrancheId(name);
@@ -319,17 +308,10 @@ GetNamedDSA(const char *name, bool *found)
/* Store handle for other backends to use. */
state->handle = dsa_get_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSA)
ereport(ERROR,
- (errmsg("requested DSA \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSA \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSA does not match type of existing entry")));
else
{
NamedDSAState *state = &entry->dsa;
@@ -390,7 +372,6 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
dsa_area *dsa;
entry->type = DSMR_ENTRY_TYPE_DSH;
- entry->initialized = false;
/* Initialize the LWLock tranche for the hash table. */
dsh_state->tranche = LWLockNewTrancheId(name);
@@ -408,17 +389,10 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
-
- entry->initialized = true;
}
else if (entry->type != DSMR_ENTRY_TYPE_DSH)
ereport(ERROR,
- (errmsg("requested DSHash \"%s\" does not match type of existing entry",
- name)));
- else if (!entry->initialized)
- ereport(ERROR,
- (errmsg("requested DSHash \"%s\" failed initialization",
- name)));
+ (errmsg("requested DSHash does not match type of existing entry")));
else
{
NamedDSHState *dsh_state = &entry->dsh;
--
2.39.5 (Apple Git-154)
v4-0002-handle-ERRORs-in-DSM-registry-functions.patchtext/plain; charset=us-asciiDownload
From 07cac41a5b69d62ac0d5fa0646cc35c3210399ce Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Nov 2025 12:28:23 -0600
Subject: [PATCH v4 2/2] handle ERRORs in DSM registry functions
---
src/backend/storage/ipc/dsm_registry.c | 121 ++++++++++++++++---------
1 file changed, 78 insertions(+), 43 deletions(-)
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index a926b9c3f32..ef6533b1100 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -155,9 +155,10 @@ init_dsm_registry(void)
{
/* Initialize dynamic shared hash table for registry. */
dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA);
+ dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
+
dsa_pin(dsm_registry_dsa);
dsa_pin_mapping(dsm_registry_dsa);
- dsm_registry_table = dshash_create(dsm_registry_dsa, &dsh_params, NULL);
/* Store handles in shared memory for other backends to use. */
DSMRegistryCtx->dsah = dsa_get_handle(dsm_registry_dsa);
@@ -188,6 +189,8 @@ GetNamedDSMSegment(const char *name, size_t size,
DSMRegistryEntry *entry;
MemoryContext oldcontext;
void *ret;
+ NamedDSMState *state;
+ dsm_segment *seg;
Assert(found);
@@ -210,36 +213,36 @@ GetNamedDSMSegment(const char *name, size_t size,
init_dsm_registry();
entry = dshash_find_or_insert(dsm_registry_table, name, found);
+ state = &entry->dsm;
if (!(*found))
{
- NamedDSMState *state = &entry->dsm;
- dsm_segment *seg;
-
entry->type = DSMR_ENTRY_TYPE_DSM;
+ state->handle = DSM_HANDLE_INVALID;
+ state->size = size;
+ }
+ else if (entry->type != DSMR_ENTRY_TYPE_DSM)
+ ereport(ERROR,
+ (errmsg("requested DSM segment does not match type of existing entry")));
+ else if (state->size != size)
+ ereport(ERROR,
+ (errmsg("requested DSM segment size does not match size of existing segment")));
+
+ if (state->handle == DSM_HANDLE_INVALID)
+ {
+ *found = false;
/* Initialize the segment. */
seg = dsm_create(size, 0);
+ if (init_callback)
+ (*init_callback) (dsm_segment_address(seg));
+
dsm_pin_segment(seg);
dsm_pin_mapping(seg);
state->handle = dsm_segment_handle(seg);
- state->size = size;
- ret = dsm_segment_address(seg);
-
- if (init_callback)
- (*init_callback) (ret);
}
- else if (entry->type != DSMR_ENTRY_TYPE_DSM)
- ereport(ERROR,
- (errmsg("requested DSM segment does not match type of existing entry")));
- else if (entry->dsm.size != size)
- ereport(ERROR,
- (errmsg("requested DSM segment size does not match size of existing segment")));
else
{
- NamedDSMState *state = &entry->dsm;
- dsm_segment *seg;
-
/* If the existing segment is not already attached, attach it now. */
seg = dsm_find_mapping(state->handle);
if (seg == NULL)
@@ -250,10 +253,9 @@ GetNamedDSMSegment(const char *name, size_t size,
dsm_pin_mapping(seg);
}
-
- ret = dsm_segment_address(seg);
}
+ ret = dsm_segment_address(seg);
dshash_release_lock(dsm_registry_table, entry);
MemoryContextSwitchTo(oldcontext);
@@ -274,6 +276,7 @@ GetNamedDSA(const char *name, bool *found)
DSMRegistryEntry *entry;
MemoryContext oldcontext;
dsa_area *ret;
+ NamedDSAState *state;
Assert(found);
@@ -292,14 +295,28 @@ GetNamedDSA(const char *name, bool *found)
init_dsm_registry();
entry = dshash_find_or_insert(dsm_registry_table, name, found);
+ state = &entry->dsa;
if (!(*found))
{
- NamedDSAState *state = &entry->dsa;
-
entry->type = DSMR_ENTRY_TYPE_DSA;
+ state->handle = DSA_HANDLE_INVALID;
+ state->tranche = -1;
+ }
+ else if (entry->type != DSMR_ENTRY_TYPE_DSA)
+ ereport(ERROR,
+ (errmsg("requested DSA does not match type of existing entry")));
+
+ if (state->tranche == -1)
+ {
+ *found = false;
/* Initialize the LWLock tranche for the DSA. */
state->tranche = LWLockNewTrancheId(name);
+ }
+
+ if (state->handle == DSA_HANDLE_INVALID)
+ {
+ *found = false;
/* Initialize the DSA. */
ret = dsa_create(state->tranche);
@@ -309,17 +326,11 @@ GetNamedDSA(const char *name, bool *found)
/* Store handle for other backends to use. */
state->handle = dsa_get_handle(ret);
}
- else if (entry->type != DSMR_ENTRY_TYPE_DSA)
+ else if (dsa_is_attached(state->handle))
ereport(ERROR,
- (errmsg("requested DSA does not match type of existing entry")));
+ (errmsg("requested DSA already attached to current process")));
else
{
- NamedDSAState *state = &entry->dsa;
-
- if (dsa_is_attached(state->handle))
- ereport(ERROR,
- (errmsg("requested DSA already attached to current process")));
-
/* Attach to existing DSA. */
ret = dsa_attach(state->handle);
dsa_pin_mapping(ret);
@@ -346,6 +357,7 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
DSMRegistryEntry *entry;
MemoryContext oldcontext;
dshash_table *ret;
+ NamedDSHState *dsh_state;
Assert(params);
Assert(found);
@@ -365,45 +377,57 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found)
init_dsm_registry();
entry = dshash_find_or_insert(dsm_registry_table, name, found);
+ dsh_state = &entry->dsh;
if (!(*found))
{
- NamedDSHState *dsh_state = &entry->dsh;
- dshash_parameters params_copy;
- dsa_area *dsa;
-
entry->type = DSMR_ENTRY_TYPE_DSH;
+ dsh_state->dsa_handle = DSA_HANDLE_INVALID;
+ dsh_state->dsh_handle = DSHASH_HANDLE_INVALID;
+ dsh_state->tranche = -1;
+ }
+ else if (entry->type != DSMR_ENTRY_TYPE_DSH)
+ ereport(ERROR,
+ (errmsg("requested DSHash does not match type of existing entry")));
+
+ if (dsh_state->tranche == -1)
+ {
+ *found = false;
/* Initialize the LWLock tranche for the hash table. */
dsh_state->tranche = LWLockNewTrancheId(name);
+ }
+
+ if (dsh_state->dsa_handle == DSA_HANDLE_INVALID)
+ {
+ dshash_parameters params_copy;
+ dsa_area *dsa;
+
+ *found = false;
/* Initialize the DSA for the hash table. */
dsa = dsa_create(dsh_state->tranche);
- dsa_pin(dsa);
- dsa_pin_mapping(dsa);
/* Initialize the dshash table. */
memcpy(¶ms_copy, params, sizeof(dshash_parameters));
params_copy.tranche_id = dsh_state->tranche;
ret = dshash_create(dsa, ¶ms_copy, NULL);
+ dsa_pin(dsa);
+ dsa_pin_mapping(dsa);
+
/* Store handles for other backends to use. */
dsh_state->dsa_handle = dsa_get_handle(dsa);
dsh_state->dsh_handle = dshash_get_hash_table_handle(ret);
}
- else if (entry->type != DSMR_ENTRY_TYPE_DSH)
+ else if (dsa_is_attached(dsh_state->dsa_handle))
ereport(ERROR,
- (errmsg("requested DSHash does not match type of existing entry")));
+ (errmsg("requested DSHash already attached to current process")));
else
{
- NamedDSHState *dsh_state = &entry->dsh;
dsa_area *dsa;
/* XXX: Should we verify params matches what table was created with? */
- if (dsa_is_attached(dsh_state->dsa_handle))
- ereport(ERROR,
- (errmsg("requested DSHash already attached to current process")));
-
/* Attach to existing DSA for the hash table. */
dsa = dsa_attach(dsh_state->dsa_handle);
dsa_pin_mapping(dsa);
@@ -439,6 +463,17 @@ pg_get_dsm_registry_allocations(PG_FUNCTION_ARGS)
Datum vals[3];
bool nulls[3] = {0};
+ /* Do not show partially-initialized entries. */
+ if (entry->type == DSMR_ENTRY_TYPE_DSM &&
+ entry->dsm.handle == DSM_HANDLE_INVALID)
+ continue;
+ if (entry->type == DSMR_ENTRY_TYPE_DSA &&
+ entry->dsa.handle == DSA_HANDLE_INVALID)
+ continue;
+ if (entry->type == DSMR_ENTRY_TYPE_DSH &&
+ entry->dsh.dsa_handle == DSA_HANDLE_INVALID)
+ continue;
+
vals[0] = CStringGetTextDatum(entry->name);
vals[1] = CStringGetTextDatum(DSMREntryTypeNames[entry->type]);
--
2.39.5 (Apple Git-154)
On Tue, Nov 25, 2025 at 4:16 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
My tests seem happy, so I will plan on committing these patches tomorrow.
The only difference in v4 is that pg_get_dsm_registry_allocations() will no
longer show partially-initialized entries. I thought that was the best
option in this case because such entries shouldn't have any allocated
memory, and retrying GetNamed*() would set *found to false.
Without actually opening the new patch files, that seems OK to me. I
think the important thing about such a view is that it shouldn't make
any state changes; it should just show how things are. In my ideal
world, it would probably show partially-initialized entires in some
distinguishable way, like with a null size. But leaving them out is
also defensible. The thing that I think really matters for failure
cases is the ability to retry whatever's gone wrong, either to get it
fixed or to see the error, and the other changes already accomplish
that goal, so I'm happy.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 07:13:03PM -0500, Robert Haas wrote:
On Tue, Nov 25, 2025 at 4:16 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
My tests seem happy, so I will plan on committing these patches tomorrow.
The only difference in v4 is that pg_get_dsm_registry_allocations() will no
longer show partially-initialized entries. I thought that was the best
option in this case because such entries shouldn't have any allocated
memory, and retrying GetNamed*() would set *found to false.Without actually opening the new patch files, that seems OK to me. I
think the important thing about such a view is that it shouldn't make
any state changes; it should just show how things are. In my ideal
world, it would probably show partially-initialized entires in some
distinguishable way, like with a null size. But leaving them out is
also defensible. The thing that I think really matters for failure
cases is the ability to retry whatever's gone wrong, either to get it
fixed or to see the error, and the other changes already accomplish
that goal, so I'm happy.
Committed, thanks for reviewing.
--
nathan