Fix crash when non-creator being an iteration on shared radix tree

Started by Masahiko Sawadaabout 1 year ago10 messages
#1Masahiko Sawada
sawada.mshk@gmail.com
1 attachment(s)

Hi,

I found that a server crashes due to a null-pointer-dereference if a
process attached to the shared radix tree begins an iteration on it,
because we don't create the memory context for iter_context at
RT_ATTACH(). There is no code in the core causing this crash in the
core since in parallel vacuum, the leader process always creates the
shared radix tree and begins the iteration. However it could happen in
external extensions. I've attached the patch to fix it and I think it
should be backpatched to v17.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-radixtree-Fix-crash-when-non-creator-begins-an-it.patchapplication/octet-stream; name=v1-0001-radixtree-Fix-crash-when-non-creator-begins-an-it.patchDownload
From 01d6ba83347279ea53ffc95dc000dd906a4d64f0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 16 Dec 2024 10:11:16 -0800
Subject: [PATCH v1] radixtree: Fix crash when non-creator begins an iteration
 on shared tree.

Previously, a radix tree can be created on shared memory to share
among with other processes but processes who attached to the shared
radix tree could not begin the iteration since they don't create the
iteration context in RT_ATTACH(). To avoid null-pointer-dereference
crash, this commit creates the iter_context in RT_ATTACH().

Backpatch to v17, where radixtree.h was introduced.

Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through: 17
---
 src/include/lib/radixtree.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1301f3fee44..39fc56cb9fb 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1895,6 +1895,10 @@ RT_CREATE(MemoryContext ctx)
 }
 
 #ifdef RT_SHMEM
+/*
+ * Attach to an existing shared radix tree using a handle. The returned object
+ * is allocated in backend-local memory using the current memory context.
+ */
 RT_SCOPE	RT_RADIX_TREE *
 RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 {
@@ -1902,6 +1906,7 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	dsa_pointer control;
 
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = CurrentMemoryContext;
 
 	/* Find the control object in shared memory */
 	control = handle;
@@ -1910,13 +1915,28 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
+	/*
+	 * Create the iteration context so that the attached backend also can
+	 * begin the iteration.
+	 */
+	tree->iter_context = AllocSetContextCreate(CurrentMemoryContext,
+											   RT_STR(RT_PREFIX) "_radix_tree iter context",
+											   ALLOCSET_SMALL_SIZES);
+
 	return tree;
 }
 
+/*
+ * Detach from a shared radix tree. This frees backend-local resource associated
+ * with the radix tree, but the radix tree will continue to exist until it is
+ * either explicitly destroyed (by a backend that is still attached to it), or
+ * the area that backs it is returned to the operating system.
+ */
 RT_SCOPE void
 RT_DETACH(RT_RADIX_TREE * tree)
 {
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	MemoryContextDelete(tree->iter_context);
 	pfree(tree);
 }
 
-- 
2.43.5

#2John Naylor
johncnaylorls@gmail.com
In reply to: Masahiko Sawada (#1)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Wed, Dec 18, 2024 at 12:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

I found that a server crashes due to a null-pointer-dereference if a
process attached to the shared radix tree begins an iteration on it,
because we don't create the memory context for iter_context at
RT_ATTACH(). There is no code in the core causing this crash in the
core since in parallel vacuum, the leader process always creates the
shared radix tree and begins the iteration. However it could happen in
external extensions. I've attached the patch to fix it and I think it
should be backpatched to v17.

+1 in general, but I wonder if instead the iter_context should be
created within RT_BEGIN_ITERATE -- I imagine that would have less
duplication and would be as safe, but I haven't tried it. Is there
some reason not to do that?

--
John Naylor
Amazon Web Services

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: John Naylor (#2)
1 attachment(s)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Tue, Dec 17, 2024 at 11:12 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Dec 18, 2024 at 12:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

I found that a server crashes due to a null-pointer-dereference if a
process attached to the shared radix tree begins an iteration on it,
because we don't create the memory context for iter_context at
RT_ATTACH(). There is no code in the core causing this crash in the
core since in parallel vacuum, the leader process always creates the
shared radix tree and begins the iteration. However it could happen in
external extensions. I've attached the patch to fix it and I think it
should be backpatched to v17.

+1 in general, but I wonder if instead the iter_context should be
created within RT_BEGIN_ITERATE -- I imagine that would have less
duplication and would be as safe, but I haven't tried it. Is there
some reason not to do that?

I agree that it has less duplication. There is no strong reason I
didn't do that. I just didn't want to check 'if (!tree->iter_context)'
in RT_BEGIN_ITERATE for simplicity. I've changed the patch
accordingly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-radixtree-Fix-crash-when-non-creator-begins-an-it.patchapplication/octet-stream; name=v2-0001-radixtree-Fix-crash-when-non-creator-begins-an-it.patchDownload
From ec47e433103099bb6c992e678b7ed925637a70d1 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 16 Dec 2024 10:11:16 -0800
Subject: [PATCH v2] radixtree: Fix crash when non-creator begins an iteration
 on shared tree.

Previously, a radix tree can be created on shared memory to share
among with other processes but processes who attached to the shared
radix tree could not begin the iteration since they don't create the
iteration context in RT_ATTACH(). It caused a server crash due to
null-pointer-dereference when a non-creator begins an iteration on
shared radix tree.

To fix it, this commit changes the iter_context to be created in
RT_BEGIN_ITERATE() in both shared and local cases.

Backpatch to v17, where radixtree.h was introduced.

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoBB2U47V%3DF%2BwQRB1bERov_of5%3DBOZGaybjaV8FLQyqG3Q%40mail.gmail.com
Backpatch-through: 17
---
 src/include/lib/radixtree.h | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1301f3fee44..2b866e6d51c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1836,14 +1836,6 @@ RT_CREATE(MemoryContext ctx)
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
 
-	/*
-	 * Separate context for iteration in case the tree context doesn't support
-	 * pfree
-	 */
-	tree->iter_context = AllocSetContextCreate(ctx,
-											   RT_STR(RT_PREFIX) "_radix_tree iter context",
-											   ALLOCSET_SMALL_SIZES);
-
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
 	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
@@ -1895,6 +1887,10 @@ RT_CREATE(MemoryContext ctx)
 }
 
 #ifdef RT_SHMEM
+/*
+ * Attach to an existing shared radix tree using a handle. The returned object
+ * is allocated in backend-local memory using the current memory context.
+ */
 RT_SCOPE	RT_RADIX_TREE *
 RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 {
@@ -1902,6 +1898,7 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	dsa_pointer control;
 
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = CurrentMemoryContext;
 
 	/* Find the control object in shared memory */
 	control = handle;
@@ -1913,10 +1910,18 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	return tree;
 }
 
+/*
+ * Detach from a shared radix tree. This frees backend-local resource associated
+ * with the radix tree, but the radix tree will continue to exist until it is
+ * either explicitly destroyed (by a backend that is still attached to it), or
+ * the area that backs it is returned to the operating system.
+ */
 RT_SCOPE void
 RT_DETACH(RT_RADIX_TREE * tree)
 {
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	if (tree->iter_context)
+		MemoryContextDelete(tree->iter_context);
 	pfree(tree);
 }
 
@@ -2086,6 +2091,15 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 	RT_ITER    *iter;
 	RT_CHILD_PTR root;
 
+	/*
+	 * Create a separate context for iteration in case the context doesn't
+	 * support pfree, if not created yet.
+	 */
+	if (tree->iter_context == NULL)
+		tree->iter_context = AllocSetContextCreate(tree->context,
+												   RT_STR(RT_PREFIX) "_radix_tree iter context",
+												   ALLOCSET_SMALL_SIZES);
+
 	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
 											  sizeof(RT_ITER));
 	iter->tree = tree;
-- 
2.43.5

#4John Naylor
johncnaylorls@gmail.com
In reply to: Masahiko Sawada (#3)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Thu, Dec 19, 2024 at 1:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 17, 2024 at 11:12 PM John Naylor <johncnaylorls@gmail.com> wrote:

+1 in general, but I wonder if instead the iter_context should be
created within RT_BEGIN_ITERATE -- I imagine that would have less
duplication and would be as safe, but I haven't tried it. Is there
some reason not to do that?

I agree that it has less duplication. There is no strong reason I
didn't do that. I just didn't want to check 'if (!tree->iter_context)'
in RT_BEGIN_ITERATE for simplicity. I've changed the patch
accordingly.

I see what you mean. For v17, a bit of duplication is probably worth
it for simplicity, so I'd say v1 is fine there.

However, I think on master we should reconsider some aspects of memory
management more broadly:

1. The creator allocates the root of the tree in a new child context,
but an attaching process allocates it in its current context, and we
pfree it when the caller wants to detach. It seems like we could
always allocate this small struct in CurrentMemoryContext for
consistency.

2. The iter_context is separate because the creator's new context
could be a bump context which doesn't support pfree. But above we
assume we can pfree in the caller's context. Also, IIUC we only
allocate small iter objects, and it'd be unusual to need more than one
at a time per backend, so it's a bit strange to have an entire context
for that. Since we use a standard pattern of "begin; while(iter);
end;", it seems unlikely that someone will cause a leak because of a
coding mistake in iteration.

If these tiny admin structs were always, not sometimes, in the callers
current context, I think it would be easier to reason about because
then the creator's passed context would be used only for local memory,
specifically only for leaves and the inner node child contexts.
Thoughts?

Further,

3. I was never a fan of trying to second-guess the creator's new
context and instead use slab for fixed-sized leaf allocations. If the
creator passes a bump context, we say "no, no, no, use slab -- it's
good for ya". Let's assume the caller knows what they're doing.

4. For local memory, an allocated "control object" serves no real
purpose and wastes a few cycles on every access. I'm not sure it
matters that much as a future micro-optimization, but I mention it
here because if we did start allocating outer structs in the callers
context, embedding would also remove the need to pfree it.

--
John Naylor
Amazon Web Services

#5Masahiko Sawada
sawada.mshk@gmail.com
In reply to: John Naylor (#4)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Wed, Dec 18, 2024 at 10:32 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Dec 19, 2024 at 1:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 17, 2024 at 11:12 PM John Naylor <johncnaylorls@gmail.com> wrote:

+1 in general, but I wonder if instead the iter_context should be
created within RT_BEGIN_ITERATE -- I imagine that would have less
duplication and would be as safe, but I haven't tried it. Is there
some reason not to do that?

I agree that it has less duplication. There is no strong reason I
didn't do that. I just didn't want to check 'if (!tree->iter_context)'
in RT_BEGIN_ITERATE for simplicity. I've changed the patch
accordingly.

I see what you mean. For v17, a bit of duplication is probably worth
it for simplicity, so I'd say v1 is fine there.

However, I think on master we should reconsider some aspects of memory
management more broadly:

1. The creator allocates the root of the tree in a new child context,
but an attaching process allocates it in its current context, and we
pfree it when the caller wants to detach. It seems like we could
always allocate this small struct in CurrentMemoryContext for
consistency.

2. The iter_context is separate because the creator's new context
could be a bump context which doesn't support pfree. But above we
assume we can pfree in the caller's context. Also, IIUC we only
allocate small iter objects, and it'd be unusual to need more than one
at a time per backend, so it's a bit strange to have an entire context
for that. Since we use a standard pattern of "begin; while(iter);
end;", it seems unlikely that someone will cause a leak because of a
coding mistake in iteration.

If these tiny admin structs were always, not sometimes, in the callers
current context, I think it would be easier to reason about because
then the creator's passed context would be used only for local memory,
specifically only for leaves and the inner node child contexts.
Thoughts?

Fair points. Given that we need only one iterator at a time per
backend, it would be simpler if the caller passes the pointer to an
iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
TidStoreBeginIterate() would be like:

if (TidStoreIsShared(ts))
shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
else
local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);

Further,

3. I was never a fan of trying to second-guess the creator's new
context and instead use slab for fixed-sized leaf allocations. If the
creator passes a bump context, we say "no, no, no, use slab -- it's
good for ya". Let's assume the caller knows what they're doing.

That's a valid argument but how can a user use the slab context for
leaf allocations? If the caller passes an allocset context to
RT_CREATE(), it still makes sense to usa slab context for leaf
allocation in terms of avoiding possible space wasting.

4. For local memory, an allocated "control object" serves no real
purpose and wastes a few cycles on every access. I'm not sure it
matters that much as a future micro-optimization, but I mention it
here because if we did start allocating outer structs in the callers
context, embedding would also remove the need to pfree it.

Using an allocated "control object" can simplify the codes for local
and shared trees. We cannot embed the control object into
RT_RADIX_TREE in shared cases. I agree to embed the control data if we
can implement that cleanly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#6John Naylor
johncnaylorls@gmail.com
In reply to: Masahiko Sawada (#5)
3 attachment(s)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 18, 2024 at 10:32 PM John Naylor <johncnaylorls@gmail.com> wrote:

2. The iter_context is separate because the creator's new context
could be a bump context which doesn't support pfree. But above we
assume we can pfree in the caller's context. Also, IIUC we only
allocate small iter objects, and it'd be unusual to need more than one
at a time per backend, so it's a bit strange to have an entire context
for that. Since we use a standard pattern of "begin; while(iter);
end;", it seems unlikely that someone will cause a leak because of a
coding mistake in iteration.

v3-0001 allocates the iter data in the caller's context. It's a small
patch, but still a core behavior change so proposed for master-only. I
believe your v1 is still the least invasive fix for PG17.

If these tiny admin structs were always, not sometimes, in the callers
current context, I think it would be easier to reason about because
then the creator's passed context would be used only for local memory,
specifically only for leaves and the inner node child contexts.

0002 does this.

Fair points. Given that we need only one iterator at a time per
backend, it would be simpler if the caller passes the pointer to an
iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
TidStoreBeginIterate() would be like:

if (TidStoreIsShared(ts))
shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
else
local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);

Hard for me to tell if it'd be simpler.

3. I was never a fan of trying to second-guess the creator's new
context and instead use slab for fixed-sized leaf allocations. If the
creator passes a bump context, we say "no, no, no, use slab -- it's
good for ya". Let's assume the caller knows what they're doing.

That's a valid argument but how can a user use the slab context for
leaf allocations?

It's trivial after 0001-02: 0003 removes makes one test use slab as
the passed context (only 32-bit systems would actually use it
currently).

Also, with a bit more work we could allow a NULL context for when the
caller has purposely arranged to use pointer-sized values. Did you see
any of Heikki's CSN patches? There is a radix tree used as a cache in
a context where the tree could be created and destroyed frequently.
Something about the memory blocks seems to have tickled some bad case
in the glibc allocator, and one less context might be good insurance:

/messages/by-id/718d1788-b058-40e6-bc37-8f15612b5646@iki.fi

--
John Naylor
Amazon Web Services

Attachments:

v3-0001-Use-caller-s-current-memory-context-for-radix-tre.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Use-caller-s-current-memory-context-for-radix-tre.patchDownload
From cd5f4e63dd7a49a0884249e55bf2b78293faa4b9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Dec 2024 12:01:50 +0700
Subject: [PATCH v3 1/3] Use caller's current memory context for radix tree
 iteration

---
 src/include/lib/radixtree.h | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1301f3fee4..8fe2302652 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -719,7 +719,6 @@ struct RT_RADIX_TREE
 	/* leaf_context is used only for single-value leaves */
 	MemoryContextData *leaf_context;
 #endif
-	MemoryContextData *iter_context;
 };
 
 /*
@@ -1836,14 +1835,6 @@ RT_CREATE(MemoryContext ctx)
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
 
-	/*
-	 * Separate context for iteration in case the tree context doesn't support
-	 * pfree
-	 */
-	tree->iter_context = AllocSetContextCreate(ctx,
-											   RT_STR(RT_PREFIX) "_radix_tree iter context",
-											   ALLOCSET_SMALL_SIZES);
-
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
 	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
@@ -2075,7 +2066,8 @@ RT_FREE(RT_RADIX_TREE * tree)
 /***************** ITERATION *****************/
 
 /*
- * Create and return the iterator for the given radix tree.
+ * Create and return the iterator for the given radix tree,
+ * in the caller's current memory context.
  *
  * Taking a lock in shared mode during the iteration is the caller's
  * responsibility.
@@ -2086,8 +2078,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 	RT_ITER    *iter;
 	RT_CHILD_PTR root;
 
-	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
-											  sizeof(RT_ITER));
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
 	iter->tree = tree;
 
 	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-- 
2.47.1

v3-0003-Always-use-the-caller-provided-context-for-radix-.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Always-use-the-caller-provided-context-for-radix-.patchDownload
From 79d22b266f0df433dcb9147d0d64db53c681f6c7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Dec 2024 14:48:24 +0700
Subject: [PATCH v3 3/3] Always use the caller-provided context for radix tree
 leaves

Previously, RT_CREATE would create an additional slab context
if the values were fixed-length and larger than pointer size.
Commit XXXXX arranged so that the caller-provided context is only used for
leaves and nothing else, so if we override that choice, that context
will go completely unused. That's confusing, and we have no reason to
believe we know better than the developer, so just use what the caller
provided.

As demonstration, use slab in one of the regression tests
---
 src/include/lib/radixtree.h                      | 14 --------------
 src/test/modules/test_radixtree/test_radixtree.c |  5 +++--
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 32f312305e..aa6354b2a4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1849,21 +1849,7 @@ RT_CREATE(MemoryContext ctx)
 												size_class.allocsize);
 	}
 
-	/* By default we use the passed context for leaves. */
 	tree->leaf_context = ctx;
-
-#ifndef RT_VARLEN_VALUE_SIZE
-
-	/*
-	 * For leaves storing fixed-length values, we use a slab context to avoid
-	 * the possibility of space wastage by power-of-2 rounding up.
-	 */
-	if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
-		tree->leaf_context = SlabContextCreate(ctx,
-											   RT_STR(RT_PREFIX) "_radix_tree leaf context",
-											   RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
-											   sizeof(RT_VALUE_TYPE));
-#endif							/* !RT_VARLEN_VALUE_SIZE */
 #endif							/* RT_SHMEM */
 
 	/* add root node now so that RT_SET can assume it exists */
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 8074b83695..5c54962edf 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -313,9 +313,10 @@ test_random(void)
 #else
 	MemoryContext radixtree_ctx;
 
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+	radixtree_ctx = SlabContextCreate(CurrentMemoryContext,
 										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
+										  SLAB_DEFAULT_BLOCK_SIZE,
+										  sizeof(TestValueType));
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
-- 
2.47.1

v3-0002-Get-rid-of-radix-tree-s-general-purpose-context.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Get-rid-of-radix-tree-s-general-purpose-context.patchDownload
From 1a0bc58c91eaf007db385a12798e50afc514f7c9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Dec 2024 13:04:18 +0700
Subject: [PATCH v3 2/3] Get rid of radix tree's general purpose context

Previously, this was only used for the root of the tree and,
for local memory, the control object. These can just as well
be allocated in the caller's current context.

This makes the memory context parameter from RT_CREATE for
shared memory unused, so remove it adjust callers to match.
---
 src/backend/access/common/tidstore.c          | 15 +++----
 src/include/lib/radixtree.h                   | 35 ++++++---------
 .../modules/test_radixtree/test_radixtree.c   | 45 ++++++++-----------
 3 files changed, 37 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index a7179759d6..786ed66b47 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -116,7 +116,7 @@ struct TidStore
 	/* MemoryContext where the TidStore is allocated */
 	MemoryContext context;
 
-	/* MemoryContext that the radix tree uses */
+	/* MemoryContext for the radix tree when using local memory, NULL for shared memory */
 	MemoryContext rt_context;
 
 	/* Storage for TIDs. Use either one depending on TidStoreIsShared() */
@@ -217,10 +217,6 @@ TidStoreCreateShared(size_t max_bytes, int tranche_id)
 	ts = palloc0(sizeof(TidStore));
 	ts->context = CurrentMemoryContext;
 
-	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
-										   "TID storage meta data",
-										   ALLOCSET_SMALL_SIZES);
-
 	/*
 	 * Choose the initial and maximum DSA segment sizes to be no longer than
 	 * 1/8 of max_bytes.
@@ -235,8 +231,7 @@ TidStoreCreateShared(size_t max_bytes, int tranche_id)
 		dsa_init_size = dsa_max_size;
 
 	area = dsa_create_ext(tranche_id, dsa_init_size, dsa_max_size);
-	ts->tree.shared = shared_ts_create(ts->rt_context, area,
-									   tranche_id);
+	ts->tree.shared = shared_ts_create(area, tranche_id);
 	ts->area = area;
 
 	return ts;
@@ -328,13 +323,13 @@ TidStoreDestroy(TidStore *ts)
 	if (TidStoreIsShared(ts))
 	{
 		shared_ts_free(ts->tree.shared);
-
 		dsa_detach(ts->area);
 	}
 	else
+	{
 		local_ts_free(ts->tree.local);
-
-	MemoryContextDelete(ts->rt_context);
+		MemoryContextDelete(ts->rt_context);
+	}
 
 	pfree(ts);
 }
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 8fe2302652..32f312305e 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -275,7 +275,7 @@ typedef dsa_pointer RT_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
-RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE	RT_RADIX_TREE *RT_CREATE(dsa_area *dsa, int tranche_id);
 RT_SCOPE	RT_RADIX_TREE *RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
 RT_SCOPE void RT_DETACH(RT_RADIX_TREE * tree);
 RT_SCOPE	RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE * tree);
@@ -706,8 +706,6 @@ typedef struct RT_RADIX_TREE_CONTROL
 /* Entry point for allocating and accessing the tree */
 struct RT_RADIX_TREE
 {
-	MemoryContext context;
-
 	/* pointing to either local memory or DSA */
 	RT_RADIX_TREE_CONTROL *ctl;
 
@@ -1809,31 +1807,25 @@ have_slot:
 /***************** SETUP / TEARDOWN *****************/
 
 /*
- * Create the radix tree in the given memory context and return it.
+ * Create the radix tree root in the caller's current memory context and return it.
  *
- * All local memory required for a radix tree is allocated in the given
- * memory context and its children. Note that RT_FREE() will delete all
- * allocated space within the given memory context, so the dsa_area should
- * be created in a different context.
+ * The tree's nodes and leaves are allocated in "ctx" and its children for
+ * local memory, or in "dsa" for shared memory.
  */
 RT_SCOPE	RT_RADIX_TREE *
 #ifdef RT_SHMEM
-RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+RT_CREATE(dsa_area *dsa, int tranche_id)
 #else
 RT_CREATE(MemoryContext ctx)
 #endif
 {
 	RT_RADIX_TREE *tree;
-	MemoryContext old_ctx;
 	RT_CHILD_PTR rootnode;
 #ifdef RT_SHMEM
 	dsa_pointer dp;
 #endif
 
-	old_ctx = MemoryContextSwitchTo(ctx);
-
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
-	tree->context = ctx;
 
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
@@ -1858,7 +1850,7 @@ RT_CREATE(MemoryContext ctx)
 	}
 
 	/* By default we use the passed context for leaves. */
-	tree->leaf_context = tree->context;
+	tree->leaf_context = ctx;
 
 #ifndef RT_VARLEN_VALUE_SIZE
 
@@ -1880,8 +1872,6 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl->start_shift = 0;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(0);
 
-	MemoryContextSwitchTo(old_ctx);
-
 	return tree;
 }
 
@@ -2054,13 +2044,16 @@ RT_FREE(RT_RADIX_TREE * tree)
 	 */
 	tree->ctl->magic = 0;
 	dsa_free(tree->dsa, tree->ctl->handle);
-#endif
-
+#else
 	/*
-	 * Free all space allocated within the tree's context and delete all child
+	 * Free all space allocated within the leaf context and delete all child
 	 * contexts such as those used for nodes.
 	 */
-	MemoryContextReset(tree->context);
+	MemoryContextReset(tree->leaf_context);
+
+	pfree(tree->ctl);
+#endif
+	pfree(tree);
 }
 
 /***************** ITERATION *****************/
@@ -2674,7 +2667,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE * tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	total = dsa_get_total_size(tree->dsa);
 #else
-	total = MemoryContextMemAllocated(tree->context, true);
+	total = MemoryContextMemAllocated(tree->leaf_context, true);
 #endif
 
 	return total;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 3e5aa3720c..8074b83695 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -120,25 +120,22 @@ PG_FUNCTION_INFO_V1(test_radixtree);
 static void
 test_empty(void)
 {
-	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
 	rt_iter    *iter;
 	uint64		key;
 #ifdef TEST_SHARED_RT
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
-#endif
-
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
 
-#ifdef TEST_SHARED_RT
 	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
-
-	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+	radixtree = rt_create(dsa, tranche_id);
 #else
+	MemoryContext radixtree_ctx;
+
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "test_radix_tree",
+										  ALLOCSET_SMALL_SIZES);
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
@@ -165,7 +162,6 @@ test_empty(void)
 static void
 test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 {
-	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
 	rt_iter    *iter;
 	uint64	   *keys;
@@ -173,18 +169,16 @@ test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 #ifdef TEST_SHARED_RT
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
-#endif
 
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
-
-#ifdef TEST_SHARED_RT
 	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
-
-	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+	radixtree = rt_create(dsa, tranche_id);
 #else
+	MemoryContext radixtree_ctx;
+
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "test_radix_tree",
+										  ALLOCSET_SMALL_SIZES);
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
@@ -300,7 +294,6 @@ key_cmp(const void *a, const void *b)
 static void
 test_random(void)
 {
-	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
 	rt_iter    *iter;
 	pg_prng_state state;
@@ -313,18 +306,16 @@ test_random(void)
 #ifdef TEST_SHARED_RT
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
-#endif
 
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
-
-#ifdef TEST_SHARED_RT
 	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
-
-	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+	radixtree = rt_create(dsa, tranche_id);
 #else
+	MemoryContext radixtree_ctx;
+
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "test_radix_tree",
+										  ALLOCSET_SMALL_SIZES);
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
-- 
2.47.1

#7Masahiko Sawada
sawada.mshk@gmail.com
In reply to: John Naylor (#6)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Fri, Dec 20, 2024 at 2:27 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Dec 20, 2024 at 4:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 18, 2024 at 10:32 PM John Naylor <johncnaylorls@gmail.com> wrote:

2. The iter_context is separate because the creator's new context
could be a bump context which doesn't support pfree. But above we
assume we can pfree in the caller's context. Also, IIUC we only
allocate small iter objects, and it'd be unusual to need more than one
at a time per backend, so it's a bit strange to have an entire context
for that. Since we use a standard pattern of "begin; while(iter);
end;", it seems unlikely that someone will cause a leak because of a
coding mistake in iteration.

v3-0001 allocates the iter data in the caller's context. It's a small
patch, but still a core behavior change so proposed for master-only. I
believe your v1 is still the least invasive fix for PG17.

I agree to use v1 for v17.

If these tiny admin structs were always, not sometimes, in the callers
current context, I think it would be easier to reason about because
then the creator's passed context would be used only for local memory,
specifically only for leaves and the inner node child contexts.

0002 does this.

Fair points. Given that we need only one iterator at a time per
backend, it would be simpler if the caller passes the pointer to an
iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
TidStoreBeginIterate() would be like:

if (TidStoreIsShared(ts))
shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
else
local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);

Hard for me to tell if it'd be simpler.

3. I was never a fan of trying to second-guess the creator's new
context and instead use slab for fixed-sized leaf allocations. If the
creator passes a bump context, we say "no, no, no, use slab -- it's
good for ya". Let's assume the caller knows what they're doing.

That's a valid argument but how can a user use the slab context for
leaf allocations?

It's trivial after 0001-02: 0003 removes makes one test use slab as
the passed context (only 32-bit systems would actually use it
currently).

These changes make sense to me. Here are a few comments:

RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for
local memory. Do we want to declare only in the !RT_SHMEM case?

---
/*
* Similar to TidStoreCreateLocal() but create a shared TidStore on a
* DSA area. The TID storage will live in the DSA area, and the memory
* context rt_context will have only meta data of the radix tree.
*
* The returned object is allocated in backend-local memory.
*/

We need to update the comment about rt_context too since we no longer
use rt_context for shared tidstore.

Also, with a bit more work we could allow a NULL context for when the
caller has purposely arranged to use pointer-sized values. Did you see
any of Heikki's CSN patches? There is a radix tree used as a cache in
a context where the tree could be created and destroyed frequently.
Something about the memory blocks seems to have tickled some bad case
in the glibc allocator, and one less context might be good insurance:

/messages/by-id/718d1788-b058-40e6-bc37-8f15612b5646@iki.fi

Will check these patches. Thanks!

Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#8John Naylor
johncnaylorls@gmail.com
In reply to: Masahiko Sawada (#7)
3 attachment(s)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Sat, Dec 21, 2024 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 20, 2024 at 2:27 AM John Naylor <johncnaylorls@gmail.com> wrote:

v3-0001 allocates the iter data in the caller's context. It's a small
patch, but still a core behavior change so proposed for master-only. I
believe your v1 is still the least invasive fix for PG17.

I agree to use v1 for v17.

Okay, did you want to commit that separately, or together with my 0001
on master? Either way, I've put a bit more effort into the commit
message in v4 and adjusted the comment slightly.

It's trivial after 0001-02: 0003 removes makes one test use slab as
the passed context (only 32-bit systems would actually use it
currently).

These changes make sense to me. Here are a few comments:

RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for
local memory. Do we want to declare only in the !RT_SHMEM case?

That's already the case, if I understand your statement correctly.

---
/*
* Similar to TidStoreCreateLocal() but create a shared TidStore on a
* DSA area. The TID storage will live in the DSA area, and the memory
* context rt_context will have only meta data of the radix tree.
*
* The returned object is allocated in backend-local memory.
*/

We need to update the comment about rt_context too since we no longer
use rt_context for shared tidstore.

Fixed. BTW, it seems TidStore.context is unused?

--
John Naylor
Amazon Web Services

Attachments:

v4-0003-Always-use-the-caller-provided-context-for-radix-.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Always-use-the-caller-provided-context-for-radix-.patchDownload
From 679cf5c305b3f0affb5c4a679cf18eb0674e8691 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Dec 2024 14:48:24 +0700
Subject: [PATCH v4 3/3] Always use the caller-provided context for radix tree
 leaves

Previously, it may not have worked for a caller to pass a slab context,
since it would have been also used for other things which may have
incompatible size. In an attempt to helpfully avoid wasting space due
to aset's power-of-two rounding, RT_CREATE would create an additional
slab context if the values were fixed-length and larger than pointer
size. The problem was, we have since added the bump context type,
and the generation context was a possibility as well, so silently
overriding the caller's choice is not friendly.

Commit XXXXXXXXX arranged so that the caller-provided context is
only used for leaves and nothing else, so it's safe for the caller
to use slab here if they wish. As demonstration, use slab in one of
the radix tree regression tests.

Reviewed by Masahiko Sawada
---
 src/include/lib/radixtree.h                      | 14 --------------
 src/test/modules/test_radixtree/test_radixtree.c |  5 +++--
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 97aff227c5..fc7474e0c7 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1849,21 +1849,7 @@ RT_CREATE(MemoryContext ctx)
 												size_class.allocsize);
 	}
 
-	/* By default we use the passed context for leaves. */
 	tree->leaf_context = ctx;
-
-#ifndef RT_VARLEN_VALUE_SIZE
-
-	/*
-	 * For leaves storing fixed-length values, we use a slab context to avoid
-	 * the possibility of space wastage by power-of-2 rounding up.
-	 */
-	if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
-		tree->leaf_context = SlabContextCreate(ctx,
-											   RT_STR(RT_PREFIX) "_radix_tree leaf context",
-											   RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
-											   sizeof(RT_VALUE_TYPE));
-#endif							/* !RT_VARLEN_VALUE_SIZE */
 #endif							/* RT_SHMEM */
 
 	/* add root node now so that RT_SET can assume it exists */
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 8074b83695..5c54962edf 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -313,9 +313,10 @@ test_random(void)
 #else
 	MemoryContext radixtree_ctx;
 
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+	radixtree_ctx = SlabContextCreate(CurrentMemoryContext,
 										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
+										  SLAB_DEFAULT_BLOCK_SIZE,
+										  sizeof(TestValueType));
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
-- 
2.47.1

v4-0001-Get-rid-of-radix-tree-s-separate-iterator-context.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Get-rid-of-radix-tree-s-separate-iterator-context.patchDownload
From 46056af82a7516250b5cdb817579aeeb3952b8de Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Dec 2024 10:55:31 +0700
Subject: [PATCH v4 1/3] Get rid of radix tree's separate iterator context

Instead, just use the caller's memory context. This relieves backends
from having to create this context when they attach to a tree in
shared memory, and delete when they detach.

The iterator state is freed when ending iteration, and even if a
developer failed to follow existing examples of iteration, the state
struct is small enough to pose low risk.
---
 src/include/lib/radixtree.h | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1301f3fee4..ea20365a23 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -719,7 +719,6 @@ struct RT_RADIX_TREE
 	/* leaf_context is used only for single-value leaves */
 	MemoryContextData *leaf_context;
 #endif
-	MemoryContextData *iter_context;
 };
 
 /*
@@ -1836,14 +1835,6 @@ RT_CREATE(MemoryContext ctx)
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
 
-	/*
-	 * Separate context for iteration in case the tree context doesn't support
-	 * pfree
-	 */
-	tree->iter_context = AllocSetContextCreate(ctx,
-											   RT_STR(RT_PREFIX) "_radix_tree iter context",
-											   ALLOCSET_SMALL_SIZES);
-
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
 	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
@@ -2075,7 +2066,8 @@ RT_FREE(RT_RADIX_TREE * tree)
 /***************** ITERATION *****************/
 
 /*
- * Create and return the iterator for the given radix tree.
+ * Create and return an iterator for the given radix tree
+ * in the caller's memory context.
  *
  * Taking a lock in shared mode during the iteration is the caller's
  * responsibility.
@@ -2086,8 +2078,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 	RT_ITER    *iter;
 	RT_CHILD_PTR root;
 
-	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
-											  sizeof(RT_ITER));
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
 	iter->tree = tree;
 
 	Assert(RT_PTR_ALLOC_IS_VALID(tree->ctl->root));
-- 
2.47.1

v4-0002-Get-rid-of-radix-tree-s-general-purpose-memory-co.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Get-rid-of-radix-tree-s-general-purpose-memory-co.patchDownload
From 4ca82e1dd6a59f5536ed92319c1d59edf84ad12d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Dec 2024 13:04:18 +0700
Subject: [PATCH v4 2/3] Get rid of radix tree's general purpose memory context

Previously, this was notionally used only for the entry point of
the tree and, for local memory, the (vestigial) control object. For
local memory, leaves and nodes use the same passed context, but it's
notionally different.

However, for shared memory, the creator allocated the entry point
in this context, but attaching backends didn't have access to this,
so just used the caller's context. For the sake of consistency, now
we allocate every instance of an entry point (and control object,
if not in DSA) in the caller's context.

For local memory, this makes the "leaf context" contain the child
contexts for nodes, so it's a bit of a misnomer, but we may someday
want to make node contexts independent rather than children, so leave
it this way for now to avoid code churn.

The memory context parameter for RT_CREATE is now unused in the case
of shared memory, so remove it and adjust callers to match.

Reviewed by Masahiko Sawada
---
 src/backend/access/common/tidstore.c          | 21 ++++-----
 src/include/lib/radixtree.h                   | 35 ++++++---------
 .../modules/test_radixtree/test_radixtree.c   | 45 ++++++++-----------
 3 files changed, 41 insertions(+), 60 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index a7179759d6..6192ffd7ba 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -116,7 +116,10 @@ struct TidStore
 	/* MemoryContext where the TidStore is allocated */
 	MemoryContext context;
 
-	/* MemoryContext that the radix tree uses */
+	/*
+	 * MemoryContext for the radix tree when using local memory, NULL for
+	 * shared memory
+	 */
 	MemoryContext rt_context;
 
 	/* Storage for TIDs. Use either one depending on TidStoreIsShared() */
@@ -201,8 +204,7 @@ TidStoreCreateLocal(size_t max_bytes, bool insert_only)
 
 /*
  * Similar to TidStoreCreateLocal() but create a shared TidStore on a
- * DSA area. The TID storage will live in the DSA area, and the memory
- * context rt_context will have only meta data of the radix tree.
+ * DSA area.
  *
  * The returned object is allocated in backend-local memory.
  */
@@ -217,10 +219,6 @@ TidStoreCreateShared(size_t max_bytes, int tranche_id)
 	ts = palloc0(sizeof(TidStore));
 	ts->context = CurrentMemoryContext;
 
-	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
-										   "TID storage meta data",
-										   ALLOCSET_SMALL_SIZES);
-
 	/*
 	 * Choose the initial and maximum DSA segment sizes to be no longer than
 	 * 1/8 of max_bytes.
@@ -235,8 +233,7 @@ TidStoreCreateShared(size_t max_bytes, int tranche_id)
 		dsa_init_size = dsa_max_size;
 
 	area = dsa_create_ext(tranche_id, dsa_init_size, dsa_max_size);
-	ts->tree.shared = shared_ts_create(ts->rt_context, area,
-									   tranche_id);
+	ts->tree.shared = shared_ts_create(area, tranche_id);
 	ts->area = area;
 
 	return ts;
@@ -328,13 +325,13 @@ TidStoreDestroy(TidStore *ts)
 	if (TidStoreIsShared(ts))
 	{
 		shared_ts_free(ts->tree.shared);
-
 		dsa_detach(ts->area);
 	}
 	else
+	{
 		local_ts_free(ts->tree.local);
-
-	MemoryContextDelete(ts->rt_context);
+		MemoryContextDelete(ts->rt_context);
+	}
 
 	pfree(ts);
 }
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index ea20365a23..97aff227c5 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -275,7 +275,7 @@ typedef dsa_pointer RT_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
-RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE	RT_RADIX_TREE *RT_CREATE(dsa_area *dsa, int tranche_id);
 RT_SCOPE	RT_RADIX_TREE *RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
 RT_SCOPE void RT_DETACH(RT_RADIX_TREE * tree);
 RT_SCOPE	RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE * tree);
@@ -706,8 +706,6 @@ typedef struct RT_RADIX_TREE_CONTROL
 /* Entry point for allocating and accessing the tree */
 struct RT_RADIX_TREE
 {
-	MemoryContext context;
-
 	/* pointing to either local memory or DSA */
 	RT_RADIX_TREE_CONTROL *ctl;
 
@@ -1809,31 +1807,25 @@ have_slot:
 /***************** SETUP / TEARDOWN *****************/
 
 /*
- * Create the radix tree in the given memory context and return it.
+ * Create the radix tree root in the caller's current memory context and return it.
  *
- * All local memory required for a radix tree is allocated in the given
- * memory context and its children. Note that RT_FREE() will delete all
- * allocated space within the given memory context, so the dsa_area should
- * be created in a different context.
+ * The tree's nodes and leaves are allocated in "ctx" and its children for
+ * local memory, or in "dsa" for shared memory.
  */
 RT_SCOPE	RT_RADIX_TREE *
 #ifdef RT_SHMEM
-RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+RT_CREATE(dsa_area *dsa, int tranche_id)
 #else
 RT_CREATE(MemoryContext ctx)
 #endif
 {
 	RT_RADIX_TREE *tree;
-	MemoryContext old_ctx;
 	RT_CHILD_PTR rootnode;
 #ifdef RT_SHMEM
 	dsa_pointer dp;
 #endif
 
-	old_ctx = MemoryContextSwitchTo(ctx);
-
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
-	tree->context = ctx;
 
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
@@ -1858,7 +1850,7 @@ RT_CREATE(MemoryContext ctx)
 	}
 
 	/* By default we use the passed context for leaves. */
-	tree->leaf_context = tree->context;
+	tree->leaf_context = ctx;
 
 #ifndef RT_VARLEN_VALUE_SIZE
 
@@ -1880,8 +1872,6 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl->start_shift = 0;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(0);
 
-	MemoryContextSwitchTo(old_ctx);
-
 	return tree;
 }
 
@@ -2054,13 +2044,16 @@ RT_FREE(RT_RADIX_TREE * tree)
 	 */
 	tree->ctl->magic = 0;
 	dsa_free(tree->dsa, tree->ctl->handle);
-#endif
-
+#else
 	/*
-	 * Free all space allocated within the tree's context and delete all child
+	 * Free all space allocated within the leaf context and delete all child
 	 * contexts such as those used for nodes.
 	 */
-	MemoryContextReset(tree->context);
+	MemoryContextReset(tree->leaf_context);
+
+	pfree(tree->ctl);
+#endif
+	pfree(tree);
 }
 
 /***************** ITERATION *****************/
@@ -2674,7 +2667,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE * tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	total = dsa_get_total_size(tree->dsa);
 #else
-	total = MemoryContextMemAllocated(tree->context, true);
+	total = MemoryContextMemAllocated(tree->leaf_context, true);
 #endif
 
 	return total;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 3e5aa3720c..8074b83695 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -120,25 +120,22 @@ PG_FUNCTION_INFO_V1(test_radixtree);
 static void
 test_empty(void)
 {
-	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
 	rt_iter    *iter;
 	uint64		key;
 #ifdef TEST_SHARED_RT
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
-#endif
-
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
 
-#ifdef TEST_SHARED_RT
 	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
-
-	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+	radixtree = rt_create(dsa, tranche_id);
 #else
+	MemoryContext radixtree_ctx;
+
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "test_radix_tree",
+										  ALLOCSET_SMALL_SIZES);
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
@@ -165,7 +162,6 @@ test_empty(void)
 static void
 test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 {
-	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
 	rt_iter    *iter;
 	uint64	   *keys;
@@ -173,18 +169,16 @@ test_basic(rt_node_class_test_elem *test_info, int shift, bool asc)
 #ifdef TEST_SHARED_RT
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
-#endif
 
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
-
-#ifdef TEST_SHARED_RT
 	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
-
-	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+	radixtree = rt_create(dsa, tranche_id);
 #else
+	MemoryContext radixtree_ctx;
+
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "test_radix_tree",
+										  ALLOCSET_SMALL_SIZES);
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
@@ -300,7 +294,6 @@ key_cmp(const void *a, const void *b)
 static void
 test_random(void)
 {
-	MemoryContext radixtree_ctx;
 	rt_radix_tree *radixtree;
 	rt_iter    *iter;
 	pg_prng_state state;
@@ -313,18 +306,16 @@ test_random(void)
 #ifdef TEST_SHARED_RT
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
-#endif
 
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "test_radix_tree",
-										  ALLOCSET_SMALL_SIZES);
-
-#ifdef TEST_SHARED_RT
 	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
-
-	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+	radixtree = rt_create(dsa, tranche_id);
 #else
+	MemoryContext radixtree_ctx;
+
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "test_radix_tree",
+										  ALLOCSET_SMALL_SIZES);
 	radixtree = rt_create(radixtree_ctx);
 #endif
 
-- 
2.47.1

#9Masahiko Sawada
sawada.mshk@gmail.com
In reply to: John Naylor (#8)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Fri, Dec 20, 2024 at 9:55 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Sat, Dec 21, 2024 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 20, 2024 at 2:27 AM John Naylor <johncnaylorls@gmail.com> wrote:

v3-0001 allocates the iter data in the caller's context. It's a small
patch, but still a core behavior change so proposed for master-only. I
believe your v1 is still the least invasive fix for PG17.

I agree to use v1 for v17.

Okay, did you want to commit that separately, or together with my 0001
on master? Either way, I've put a bit more effort into the commit
message in v4 and adjusted the comment slightly.

I think we can commit them separately. It's a pure bug fix for back
branches while for HEAD it removes the bug by an improvement.

It's trivial after 0001-02: 0003 removes makes one test use slab as
the passed context (only 32-bit systems would actually use it
currently).

These changes make sense to me. Here are a few comments:

RT_RADIX_TREE has 'leaf_context' but it seems that we use it only for
local memory. Do we want to declare only in the !RT_SHMEM case?

That's already the case, if I understand your statement correctly.

True. Sorry, I misunderstood something.

---
/*
* Similar to TidStoreCreateLocal() but create a shared TidStore on a
* DSA area. The TID storage will live in the DSA area, and the memory
* context rt_context will have only meta data of the radix tree.
*
* The returned object is allocated in backend-local memory.
*/

We need to update the comment about rt_context too since we no longer
use rt_context for shared tidstore.

Fixed.

The three patches look good to me.

BTW, it seems TidStore.context is unused?

Indeed. We can remove it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10John Naylor
johncnaylorls@gmail.com
In reply to: Masahiko Sawada (#9)
Re: Fix crash when non-creator being an iteration on shared radix tree

On Thu, Dec 26, 2024 at 11:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The three patches look good to me.

Thanks for looking! I've pushed them all.

(The failure in drongo seems like an unrelated glitch.)

4. For local memory, an allocated "control object" serves no real
purpose and wastes a few cycles on every access. I'm not sure it
matters that much as a future micro-optimization, but I mention it
here because if we did start allocating outer structs in the callers
context, embedding would also remove the need to pfree it.

Using an allocated "control object" can simplify the codes for local
and shared trees. We cannot embed the control object into
RT_RADIX_TREE in shared cases. I agree to embed the control data if we
can implement that cleanly.

I tried this and it was fairly trivial to get working, but it didn't
have the intended effect so I'll leave it alone.

--
John Naylor
Amazon Web Services